type
Post
status
Published
date
Apr 3, 2025
slug
2025/04/03/Monitoring and Tuning the Linux Networking Stack: Receiving Data | Packagecloud Blog
summary
tags
Linux
category
Linux
created days
new update day
icon
password
Created_time
Mar 6, 2025 02:41 AM
Last edited time
Jul 30, 2025 03:18 AM
TL;DR(总结)
This blog post explains how computers running the Linux kernel receive packets, as well as how to monitor and tune each component of the networking stack as packets flow from the network toward userland programs.
本文将介绍运行 Linux 内核的计算机如何接收数据包,以及在数据包从网络流向用户态程序的过程中,如何监控和调整网络栈的各个组件。
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning the Linux Networking Stack: Sending Data.
我们发布了本文的姊妹篇:监控和调优 Linux 网络栈:数据发送。
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data, which adds some diagrams for the information presented below.
查看监控和调优 Linux 网络栈:数据接收的图文指南,其中为以下内容添加了一些图表。
It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.
This blog post will hopefully serve as a reference to anyone looking to do this.
如果不阅读内核源代码并深入理解具体发生的情况,就无法对 Linux 网络栈进行监控或调优。希望这篇文章能为任何想要进行此项工作的人提供参考。
Special thanks(特别感谢)
Special thanks to the folks at Private Internet Access who hired us to research this information in conjunction with other network research and who have graciously allowed us to build upon the research and publish this information.
特别感谢 Private Internet Access 的工作人员,他们聘请我们进行此项信息研究,以及其他网络研究,并慷慨地允许我们在此研究基础上进行拓展,并发布这些信息。
The information presented here builds upon the work done for Private Internet Access, which was originally published as a 5 part series starting here.
本文中的信息是在为 Private Internet Access 所做工作的基础上构建的,该工作最初以五部分系列文章的形式发布,可从此处开始阅读。
General advice on monitoring and tuning the Linux networking stack(监控和调优 Linux 网络栈的一般建议)
The networking stack is complex and there is no one size fits all solution. If the performance and health of your networking is critical to you or your business, you will have no choice but to invest a considerable amount of time, effort, and money into understanding how the various parts of the system interact.
网络堆栈非常复杂,没有放之四海而皆准的解决方案。如果网络的性能和健康状况对您或您的企业至关重要,您将别无选择,只能投入大量的时间、精力和金钱来了解系统各部分是如何相互作用的。
Ideally, you should consider measuring packet drops at each layer of the network stack. That way you can determine and narrow down which component needs to be tuned.
理想情况下,您应该考虑测量网络堆栈每一层的数据包丢失情况。这样,您就可以确定并缩小需要调整的组件范围。
This is where, I think, many operators go off track: the assumption is made that a set of sysctl settings or /proc values can simply be reused wholesale. In some cases, perhaps, but it turns out that the entire system is so nuanced and intertwined that if you desire to have meaningful monitoring or tuning, you must strive to understand how the system functions at a deep level. Otherwise, you can simply use the default settings, which should be good enough until further optimization (and the required investment to deduce those settings) is necessary.
我认为,许多运维人员在此处误入歧途:他们假设可以直接复用一组 sysctl 设置或
/proc
值。在某些情况下可能如此,但事实是整个系统非常复杂且相互关联,如果你希望进行有效的监控或调优,就必须努力深入理解系统的运作机制。否则,你只需使用默认设置,这些设置在不需要进一步优化(以及推导这些设置所需的投资)之前应该已经足够好用了。Many of the example settings provided in this blog post are used solely for illustrative purposes and are not a recommendation for or against a certain configuration or default setting. Before adjusting any setting, you should develop a frame of reference around what you need to be monitoring to notice a meaningful change.
本文中提供的许多示例设置仅用于说明目的,并非对特定配置或默认设置的推荐或反对。在调整任何设置之前,您应该围绕需要监控的内容建立一个参考框架,以便注意到有意义的变化。
Adjusting networking settings while connected to the machine over a network is dangerous; you could very easily lock yourself out or completely take out your networking. Do not adjust these settings on production machines; instead make adjustments on new machines and rotate them into production, if possible.
通过网络连接到机器时调整网络设置是很危险的,您很可能会将自己锁定在外,或者完全中断网络连接。请勿在生产机器上调整这些设置;如果可能的话,应在新机器上进行调整,然后将其轮换到生产环境中。
Overview(概述)
For reference, you may want to have a copy of the device data sheet handy. This post will examine the Intel I350 Ethernet controller, controlled by the
igb
device driver. You can find that data sheet (warning: LARGE PDF) here for your reference.为便于参考,您可能需要手头备有一份设备数据表。本文将研究由
igb
设备驱动程序控制的英特尔 I350 以太网控制器。您可以在此处找到该数据表(警告:大型 PDF 文件)以供参考。The high level path a packet takes from arrival to socket receive buffer is as follows:
- Driver is loaded and initialized.
- Packet arrives at the NIC from the network.
- Packet is copied (via DMA) to a ring buffer in kernel memory.
- Hardware interrupt is generated to let the system know a packet is in memory.
- Driver calls into NAPI to start a poll loop if one was not running already.
ksoftirqd
processes run on each CPU on the system. They are registered at boot time. Theksoftirqd
processes pull packets off the ring buffer by calling the NAPIpoll
function that the device driver registered during initialization.
- Memory regions in the ring buffer that have had network data written to them are unmapped.
- Data that was DMA’d into memory is passed up the networking layer as an ‘skb’ for more processing.
- Incoming network data frames are distributed among multiple CPUs if packet steering is enabled or if the NIC has multiple receive queues.
- Network data frames are handed to the protocol layers from the queues.
- Protocol layers process data.
- Data is added to receive buffers attached to sockets by protocol layers.
数据包从到达至进入套接字接收缓冲区的大致路径如下:
- 驱动程序被加载并初始化。
- 数据包从网络到达网络接口卡(NIC)。
- 数据包通过直接内存访问(DMA)被复制到内核内存中的环形缓冲区。
- 生成硬件中断,通知系统内存中有数据包。
- 如果轮询循环尚未运行,驱动程序会调用 NAPI 启动轮询循环。
ksoftirqd
进程在系统中的每个 CPU 上运行,它们在系统启动时注册。ksoftirqd
进程通过调用设备驱动程序在初始化期间注册的 NAPIpoll
函数,从环形缓冲区中取出数据包。
- 环形缓冲区中写入了网络数据的内存区域被取消映射。
- 被 DMA 写入内存的数据将作为 “skb ”传递到网络层进行进一步处理。
- 如果启用了数据包导向功能,或者 NIC 有多个接收队列,则传入的网络数据帧会在多个 CPU 之间分配。
- 网络数据帧从队列传递到协议层。
- 协议层处理数据。
- 数据由协议层添加到与套接字关联的接收缓冲区中。
This entire flow will be examined in detail in the following sections.
The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.
以下部分将详细研究整个流程。下面所研究的协议层是 IP 和 UDP 协议层,本文中呈现的许多信息也可作为其他协议层的参考。
Detailed Look(详细分析)
This blog post will be examining the Linux kernel version 3.13.0 with links to code on GitHub and code snippets throughout this post.
本文将研究 Linux 内核版本 3.13.0,并在文中提供 GitHub 代码链接和代码片段。
Understanding exactly how packets are received in the Linux kernel is very involved. We’ll need to closely examine and understand how a network driver works, so that parts of the network stack later are more clear.
要确切理解 Linux 内核中数据包的接收方式,涉及的内容非常多。我们需要仔细研究并理解网络驱动程序的工作原理,这样后续网络栈的部分内容会更加清晰。
This blog post will look at the
igb
network driver. This driver is used for a relatively common server NIC, the Intel Ethernet Controller I350. So, let’s start by understanding how the igb
network driver works.本文将研究
igb
网络驱动程序,该驱动程序用于相对常见的服务器 NIC—— 英特尔以太网控制器 I350。那么,让我们从了解igb
网络驱动程序的工作原理开始。Network Device Driver(网络设备驱动程序)
Initialization(初始化)
A driver registers an initialization function which is called by the kernel when the driver is loaded. This function is registered by using the
module_init
macro.驱动程序会注册一个初始化函数,在驱动程序加载时由内核调用。这个函数通过使用
module_init
宏进行注册。The
igb
initialization function (igb_init_module
) and its registration with module_init
can be found in drivers/net/ethernet/intel/igb/igb_main.c.igb
初始化函数(igb_init_module
)及其通过module_init
的注册,可以在drivers/net/ethernet/intel/igb/igb_main.c
中找到。Both are fairly straightforward:
两者都相当简单:
igb_main.c
torvalds
The bulk of the work to initialize the device happens with the call to
pci_register_driver
as we’ll see next.正如我们接下来将看到的,初始化设备的大部分工作是通过调用
pci_register_driver
完成的。PCI initialization(PCI 初始化)
The Intel I350 network card is a PCI express device.
英特尔 I350 网卡是一种 PCI Express 设备。
PCI devices identify themselves with a series of registers in the PCI Configuration Space.
PCI 设备通过 PCI 配置空间中的一系列寄存器来标识自己。
When a device driver is compiled, a macro named
MODULE_DEVICE_TABLE
(from include/module.h
) is used to export a table of PCI device IDs identifying devices that the device driver can control. The table is also registered as part of a structure, as we’ll see shortly.在编译设备驱动程序时,会使用一个名为
MODULE_DEVICE_TABLE
(来自include/module.h
)的宏,导出一个 PCI 设备 ID 表,用于识别设备驱动程序可以控制的设备。该表也作为一个结构的一部分进行注册,我们很快就会看到。The kernel uses this table to determine which device driver to load to control the device.
内核使用这个表来确定加载哪个设备驱动程序来控制设备。
That’s how the OS can figure out which devices are connected to the system and which driver should be used to talk to the device.
这就是操作系统如何确定哪些设备连接到系统,以及应该使用哪个驱动程序与设备进行通信的方式。
This table and the PCI device IDs for the
igb
driver can be found in drivers/net/ethernet/intel/igb/igb_main.c
and drivers/net/ethernet/intel/igb/e1000_hw.h
, respectively:igb
驱动程序的这个表和 PCI 设备 ID 分别可以在drivers/net/ethernet/intel/igb/igb_main.c
和drivers/net/ethernet/intel/igb/e1000_hw.h
中找到:static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = { { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_1GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_SGMII) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 }, /* ... */ }; MODULE_DEVICE_TABLE(pci, igb_pci_tbl);
As seen in the previous section,
pci_register_driver
is called in the driver’s initialization function.如前所述,在驱动程序的初始化函数中会调用
pci_register_driver
。This function registers a structure of pointers. Most of the pointers are function pointers, but the PCI device ID table is also registered. The kernel uses the functions registered by the driver to bring the PCI device up.
该函数注册一个指针结构。大部分指针是函数指针,但 PCI 设备 ID 表也被注册。内核使用驱动程序注册的函数来启动 PCI 设备。
static struct pci_driver igb_driver = { .name = igb_driver_name, .id_table = igb_pci_tbl, .probe = igb_probe, .remove = igb_remove, /* ... */ };
PCI probe(PCI 探测)
Once a device has been identified by its PCI IDs, the kernel can then select the proper driver to use to control the device. Each PCI driver registers a probe function with the PCI system in the kernel. The kernel calls this function for devices which have not yet been claimed by a device driver. Once a device is claimed, other drivers will not be asked about the device. Most drivers have a lot of code that runs to get the device ready for use. The exact things done vary from driver to driver.
通过 PCI ID 识别设备后,内核就可以选择适当的驱动程序来控制设备。每个 PCI 驱动程序都会在内核中的 PCI 系统注册一个探测函数。
对于尚未被设备驱动程序认领的设备,内核会调用该函数。一旦设备被认领,其他驱动程序就不会再询问有关该设备的信息。
大多数驱动程序都有大量代码来使设备准备好投入使用,具体操作因驱动程序而异。Some typical operations to perform include:
- Enabling the PCI device.
- Requesting memory ranges and IO ports.
- Setting the DMA mask.
- The ethtool (described more below) functions the driver supports are registered.
- Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung).
- Other device specific stuff like workarounds or dealing with hardware specific quirks or similar.
- The creation, initialization, and registration of a
struct net_device_ops
structure. This structure contains function pointers to the various functions needed for opening the device, sending data to the network, setting the MAC address, and more.
- The creation, initialization, and registration of a high level
struct net_device
which represents a network device.
一些典型的操作包括:
- 启用 PCI 设备。
- 请求内存范围和 I/O 端口。
- 设置 DMA 掩码。
- 注册驱动程序支持的 ethtool(下面会详细介绍)函数。
- 任何需要的看门狗任务(例如,e1000e 有一个看门狗任务,用于检查硬件是否挂起)。
- 其他特定于设备的操作,如解决硬件特定的问题或处理类似的硬件特性。
- 创建、初始化和注册一个
struct net_device_ops
结构,该结构包含指向打开设备、向网络发送数据、设置 MAC 地址等各种所需函数的指针。
- 创建、初始化和注册一个高级的
struct net_device
,它代表一个网络设备。
Let’s take a quick look at some of these operations in the
igb
driver in the function igb_probe
.让我们快速查看一下
igb
驱动程序中igb_probe
函数中的一些操作。A peek into PCI initialization(PCI 初始化窥探)
The following code from the
igb_probe
function does some basic PCI configuration. From drivers/net/ethernet/intel/igb/igb_main.c:以下来自
igb_probe
函数的代码进行了一些基本的 PCI 配置。在 drivers/net/ethernet/intel/igb/igb_main.c
中:err = pci_enable_device_mem(pdev); /* ... */ err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)); /* ... */ err = pci_request_selected_regions(pdev, pci_select_bars(pdev, IORESOURCE_MEM), igb_driver_name); pci_enable_pcie_error_reporting(pdev); pci_set_master(pdev); pci_save_state(pdev);
First, the device is initialized with
pci_enable_device_mem
. This will wake up the device if it is suspended, enable memory resources, and more.首先,使用
pci_enable_device_mem
初始化设备。这将唤醒处于挂起状态的设备,启用内存资源等。Next, the DMA mask will be set. This device can read and write to 64bit memory addresses, so
dma_set_mask_and_coherent
is called with DMA_BIT_MASK(64)
.接下来,设置 DMA 掩码。 由于此设备可以读写 64 位内存地址,因此调用
dma_set_mask_and_coherent
并传入DMA_BIT_MASK(64)
。Memory regions will be reserved with a call to
pci_request_selected_regions
, PCI Express Advanced Error Reporting is enabled (if the PCI AER driver is loaded), DMA is enabled with a call to pci_set_master
, and the PCI configuration space is saved with a call to pci_save_state
.通过调用
pci_request_selected_regions
预留内存区域,启用 PCI Express 高级错误报告(如果加载了 PCI AER 驱动程序),通过调用pci_set_master
启用 DMA,并通过调用pci_save_state
保存 PCI 配置空间。Phew.
pcieaer-howto.txt
torvalds
More Linux PCI driver information(更多 Linux PCI 驱动程序信息)
Going into the full explanation of how PCI devices work is beyond the scope of this post, but this excellent talk, this wiki, and this text file from the linux kernel are excellent resources.
Network device initialization(网络设备初始化)
The
igb_probe
function does some important network device initialization. In addition to the PCI specific work, it will do more general networking and network device work:- The
struct net_device_ops
is registered.
ethtool
operations are registered.
- The default MAC address is obtained from the NIC.
net_device
feature flags are set.
- And lots more.
igb_probe
函数进行了一些重要的网络设备初始化工作。除了特定于 PCI 的工作外,它还会进行更通用的网络和网络设备相关工作:- 注册
struct net_device_ops
。
- 注册
ethtool
操作。
- 从 NIC 获取默认 MAC 地址。
- 设置
net_device
功能标志。
- 还有很多其他工作。
Let’s take a look at each of these as they will be interesting later.
让我们逐个查看这些内容,因为它们在后面会很重要。
struct net_device_ops
The
struct net_device_ops
contains function pointers to lots of important operations that the network subsystem needs to control the device. We’ll be mentioning this structure many times throughout the rest of this post.struct net_device_ops
包含指向网络子系统控制设备所需的许多重要操作的函数指针。在本文的其余部分,我们会多次提到这个结构。This
net_device_ops
structure is attached to a struct net_device
in igb_probe
. From drivers/net/ethernet/intel/igb/igb_main.c)在
igb_probe
中,这个net_device_ops
结构被附加到struct net_device
上。static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ netdev->netdev_ops = &igb_netdev_ops;
And the functions that this
net_device_ops
structure holds pointers to are set in the same file. From drivers/net/ethernet/intel/igb/igb_main.c:并且这个
net_device_ops
结构所指向的函数在同一文件中设置。在drivers/net/ethernet/intel/igb/igb_main.c
中:static const struct net_device_ops igb_netdev_ops = { .ndo_open = igb_open, .ndo_stop = igb_close, .ndo_start_xmit = igb_xmit_frame, .ndo_get_stats64 = igb_get_stats64, .ndo_set_rx_mode = igb_set_rx_mode, .ndo_set_mac_address = igb_set_mac, .ndo_change_mtu = igb_change_mtu, .ndo_do_ioctl = igb_ioctl, /* ... */
As you can see, there are several interesting fields in this
struct
like ndo_open
, ndo_stop
, ndo_start_xmit
, and ndo_get_stats64
which hold the addresses of functions implemented by the igb
driver.如您所见,这个
struct
中有几个有趣的字段,如ndo_open
、ndo_stop
、ndo_start_xmit
和ndo_get_stats64
,它们保存了由igb
驱动程序实现的函数地址。We’ll be looking at some of these in more detail later.
我们稍后会更详细地查看其中一些内容。
ethtool
registration(ethtool 注册)
ethtool
is a command line program you can use to get and set various driver and hardware options. You can install it on Ubuntu by running apt-get install ethtool
.ethtool
是一个命令行程序,可用于获取和设置各种驱动程序和硬件选项。您可以在 Ubuntu 上通过运行sudo apt-get install ethtool
来安装它。A common use of
ethtool
is to gather detailed statistics from network devices. Other ethtool
settings of interest will be described later.ethtool
的一个常见用途是从网络设备收集详细的统计信息,后面将介绍其他值得关注的ethtool
设置。The
ethtool
program talks to device drivers by using the ioctl
system call. The device drivers register a series of functions that run for the ethtool
operations and the kernel provides the glue.ethtool
程序通过使用ioctl
系统调用与设备驱动程序进行通信。设备驱动程序会注册一系列用于ethtool
操作的函数,内核则提供连接机制。When an
ioctl
call is made from ethtool
, the kernel finds the ethtool
structure registered by the appropriate driver and executes the functions registered. The driver’s ethtool
function implementation can do anything from change a simple software flag in the driver to adjusting how the actual NIC hardware works by writing register values to the device.当从
ethtool
发出ioctl
调用时,内核会找到由相应驱动程序注册的ethtool
结构,并执行注册的函数。驱动程序的ethtool
函数实现可以执行各种操作,从更改驱动程序中的简单软件标志,到通过向设备写入寄存器值来调整实际 NIC 硬件的工作方式。The
igb
driver registers its ethtool
operations in igb_probe
by calling igb_set_ethtool_ops
:igb
驱动程序在igb_probe
中通过调用igb_set_ethtool_ops
来注册其ethtool
操作:static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ igb_set_ethtool_ops(netdev);
All of the
igb
driver’s ethtool
code can be found in the file drivers/net/ethernet/intel/igb/igb_ethtool.c
along with the igb_set_ethtool_ops
function.igb
驱动程序的所有ethtool
代码以及igb_set_ethtool_ops
函数都可以在drivers/net/ethernet/intel/igb/igb_ethtool.c
文件中找到。void igb_set_ethtool_ops(struct net_device *netdev) { SET_ETHTOOL_OPS(netdev, &igb_ethtool_ops); }
Above that, you can find the
igb_ethtool_ops
structure with the ethtool
functions the igb
driver supports set to the appropriate fields.在上面的代码中,您可以找到
igb_ethtool_ops
结构,其中igb
驱动程序支持的ethtool
函数被设置到相应的字段中。static const struct ethtool_ops igb_ethtool_ops = { .get_settings = igb_get_settings, .set_settings = igb_set_settings, .get_drvinfo = igb_get_drvinfo, .get_regs_len = igb_get_regs_len, .get_regs = igb_get_regs, /* ... */
It is up to the individual drivers to determine which
ethtool
functions are relevant and which should be implemented. Not all drivers implement all ethtool
functions, unfortunately.由各个驱动程序自行决定哪些
ethtool
函数是相关的,以及应该实现哪些函数。遗憾的是,并非所有驱动程序都实现了所有的ethtool
函数。One interesting
ethtool
function is get_ethtool_stats
, which (if implemented) produces detailed statistics counters that are tracked either in software in the driver or via the device itself.一个有趣的
ethtool
函数是get_ethtool_stats
,如果实现了这个函数,它会生成详细的统计计数器,这些计数器可以在驱动程序的软件中跟踪,也可以通过设备本身跟踪。The monitoring section below will show how to use
ethtool
to access these detailed statistics.下面的监控部分将展示如何使用
ethtool
来访问这些详细的统计信息。IRQs(中断)
When a data frame is written to RAM via DMA, how does the NIC tell the rest of the system that data is ready to be processed?
当数据帧通过 DMA 写入内存时,NIC 如何告知系统的其他部分数据已准备好进行处理呢?
Traditionally, a NIC would generate an interrupt request (IRQ) indicating data had arrived. There are three common types of IRQs: MSI-X, MSI, and legacy IRQs. These will be touched upon shortly. A device generating an IRQ when data has been written to RAM via DMA is simple enough, but if large numbers of data frames arrive this can lead to a large number of IRQs being generated. The more IRQs that are generated, the less CPU time is available for higher level tasks like user processes.
传统上,NIC 会生成一个中断请求(IRQ),表明数据已到达。常见的 IRQ 有三种类型:MSI-X、MSI 和传统 IRQ,稍后会简要介绍。设备在数据通过 DMA 写入内存时生成 IRQ,这本身很简单,但如果大量数据帧到达,可能会导致生成大量的 IRQ。生成的 IRQ 越多,用于更高层次任务(如用户进程)的 CPU 时间就越少。
The New Api (NAPI) was created as a mechanism for reducing the number of IRQs generated by network devices on packet arrival. While NAPI reduces the number of IRQs, it cannot eliminate them completely. We’ll see why that is, exactly, in later sections.
新的 API(NAPI)作为一种减少网络设备在数据包到达时生成 IRQ 数量的机制应运而生。虽然 NAPI 减少了 IRQ 的数量,但它无法完全消除它们,我们将在后面的章节中详细了解原因。
NAPI
NAPI differs from the legacy method of harvesting data in several important ways. NAPI allows a device driver to register a
poll
function that the NAPI subsystem will call to harvest data frames.NAPI 在几个重要方面与传统的数据收集方法不同。NAPI 允许设备驱动程序注册一个
poll
函数,NAPI 子系统会调用这个函数来收集数据帧。The intended use of NAPI in network device drivers is as follows:
- NAPI is enabled by the driver, but is in the off position initially.
- A packet arrives and is DMA’d to memory by the NIC.
- An IRQ is generated by the NIC which triggers the IRQ handler in the driver.
- The driver wakes up the NAPI subsystem using a softirq (more on these later). This will begin harvesting packets by calling the driver’s registered
poll
function in a separate thread of execution.
- The driver should disable further IRQs from the NIC. This is done to allow the NAPI subsystem to process packets without interruption from the device.
- Once there is no more work to do, the NAPI subsystem is disabled and IRQs from the device are re-enabled.
- The process starts back at step 2.
网络设备驱动程序中使用 NAPI 的预期方式如下:
- 驱动程序启用 NAPI,但最初处于关闭状态。
- 一个数据包到达,并由 NIC 通过 DMA 传输到内存。
- NIC 生成一个 IRQ,触发驱动程序中的 IRQ 处理程序。
- 驱动程序使用软中断(稍后会详细介绍)唤醒 NAPI 子系统。这将通过在一个单独的执行线程中调用驱动程序注册的
poll
函数开始收集数据包。
- 驱动程序应禁用 NIC 的进一步 IRQ,这样做是为了让 NAPI 子系统在不受设备干扰的情况下处理数据包。
- 一旦没有更多工作要做,NAPI 子系统被禁用,设备的 IRQ 重新启用。
- 该过程从步骤 2 重新开始。
This method of gathering data frames has reduced overhead compared to the legacy method because many data frames can be consumed at a time without having to deal with processing each of them one IRQ at a time.
这种收集数据帧的方法与传统方法相比,减少了开销,因为可以一次处理多个数据帧,而无需为每个数据帧单独处理一个 IRQ。
The device driver implements a
poll
function and registers it with NAPI by calling netif_napi_add
. When registering a NAPI poll
function with netif_napi_add
, the driver will also specify the weight
. Most of the drivers hardcode a value of 64
. This value and its meaning will be described in more detail below.设备驱动程序实现一个
poll
函数,并通过调用netif_napi_add
向 NAPI 注册它。在向netif_napi_add
注册 NAPI poll
函数时,驱动程序还会指定weight
,大多数驱动程序将其硬编码为64
。下面将更详细地描述这个值及其含义。Typically, drivers register their NAPI
poll
functions during driver initialization.通常,驱动程序在驱动程序初始化期间注册它们的 NAPI
poll
函数。NAPI initialization in the igb
driver(igb 驱动程序中的 NAPI 初始化)
The
igb
driver does this via a long call chain:igb_probe
callsigb_sw_init
.
igb_sw_init
callsigb_init_interrupt_scheme
.
igb_init_interrupt_scheme
callsigb_alloc_q_vectors
.
igb_alloc_q_vectors
callsigb_alloc_q_vector
.
igb_alloc_q_vector
callsnetif_napi_add
.
igb
驱动程序通过一个长调用链来完成此操作:igb_probe
调用igb_sw_init
。
igb_sw_init
调用igb_init_interrupt_scheme
。
igb_init_interrupt_scheme
调用igb_alloc_q_vectors
。
igb_alloc_q_vectors
调用igb_alloc_q_vector
。
igb_alloc_q_vector
调用netif_napi_add
。
This call trace results in a few high level things happening:
- If MSI-X is supported, it will be enabled with a call to
pci_enable_msix
.
- Various settings are computed and initialized; most notably the number of transmit and receive queues that the device and driver will use for sending and receiving packets.
igb_alloc_q_vector
is called once for every transmit and receive queue that will be created.
- Each call to
igb_alloc_q_vector
callsnetif_napi_add
to register apoll
function for that queue and an instance ofstruct napi_struct
that will be passed topoll
when called to harvest packets.
这个调用跟踪导致了一些高层次的事情发生:
- 如果支持 MSI-X,将通过调用
pci_enable_msix
启用它。
- 计算并初始化各种设置,最值得注意的是设备和驱动程序将用于发送和接收数据包的传输和接收队列的数量。
- 为每个将创建的传输和接收队列调用一次
igb_alloc_q_vector
。
- 每次对
igb_alloc_q_vector
的调用都会调用netif_napi_add
,为该队列注册一个poll
函数,以及一个struct napi_struct
实例,当调用该函数收集数据包时,这个实例将被传递给poll
函数。
Let’s take a look at
igb_alloc_q_vector
to see how the poll
callback and its private data are registered.让我们看一下
igb_alloc_q_vector
,了解poll
回调及其私有数据是如何注册的。在drivers/net/ethernet/intel/igb/igb_main.c
中:static int igb_alloc_q_vector(struct igb_adapter *adapter, int v_count, int v_idx, int txr_count, int txr_idx, int rxr_count, int rxr_idx) { /* ... */ /* allocate q_vector and rings */ q_vector = kzalloc(size, GFP_KERNEL); if (!q_vector) return -ENOMEM; /* initialize NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64); /* ... */
The above code is allocation memory for a receive queue and registering the function
igb_poll
with the NAPI subsystem. It provides a reference to the struct napi_struct
associated with this newly created RX queue (&q_vector->napi
above). This will be passed into igb_poll
when called by the NAPI subsystem when it comes time to harvest packets from this RX queue.上面的代码为接收队列分配内存,并向 NAPI 子系统注册
igb_poll
函数。它提供了一个指向与这个新创建的 RX 队列相关联的struct napi_struct
(上面的&q_vector->napi
)的引用。当 NAPI 子系统需要从这个 RX 队列收集数据包时,这个引用将被传递给igb_poll
函数。This will be important later when we examine the flow of data from drivers up the network stack.
当我们稍后检查从驱动程序到网络栈的数据流动时,这一点将很重要。
Bringing a network device up(启用网络设备)
Recall the
net_device_ops
structure we saw earlier which registered a set of functions for bringing the network device up, transmitting packets, setting the MAC address, etc.回想一下我们之前看到的
net_device_ops
结构,它注册了一组用于启用网络设备、传输数据包、设置 MAC 地址等的函数。When a network device is brought up (for example, with
ifconfig eth0 up
), the function attached to the ndo_open
field of the net_device_ops
structure is called.当启用网络设备时(例如,使用
ifconfig eth0 up
命令),会调用net_device_ops
结构中ndo_open
字段所关联的函数。The
ndo_open
function will typically do things like:- Allocate RX and TX queue memory
- Enable NAPI
- Register an interrupt handler
- Enable hardware interrupts
- And more.
ndo_open
函数通常会执行以下操作:- 分配 RX 和 TX 队列内存。
- 启用 NAPI。
- 注册一个中断处理程序。
- 启用硬件中断。
- 还有更多操作。
In the case of the
igb
driver, the function attached to the ndo_open
field of the net_device_ops
structure is called igb_open
.在
igb
驱动程序的情况下,net_device_ops
结构中ndo_open
字段所关联的函数名为igb_open
。Preparing to receive data from the network(准备从网络接收数据)
Most NICs you’ll find today will use DMA to write data directly into RAM where the OS can retrieve the data for processing. The data structure most NICs use for this purpose resembles a queue built on circular buffer (or a ring buffer).
目前,大多数网卡都使用 DMA 将数据直接写入 RAM,操作系统可以在 RAM 中读取数据进行处理。大多数 NIC 为此使用的数据结构类似于建立在圆形缓冲区(或环形缓冲区)上的队列。
In order to do this, the device driver must work with the OS to reserve a region of memory that the NIC hardware can use. Once this region is reserved, the hardware is informed of its location and incoming data will be written to RAM where it will later be picked up and processed by the networking subsystem.
为了实现这一点,设备驱动程序必须与操作系统合作,保留一块 NIC 硬件可以使用的内存区域。一旦保留了这个区域,硬件就会得知其位置,传入的数据将被写入内存,稍后将由网络子系统提取并处理。
This seems simple enough, but what if the packet rate was high enough that a single CPU was not able to properly process all incoming packets? The data structure is built on a fixed length region of memory, so incoming packets would be dropped.
这看起来很简单,但如果数据包速率足够高,以至于单个 CPU 无法正确处理所有传入的数据包会怎样呢?由于数据结构是基于固定长度的内存区域构建的,传入的数据包可能会被丢弃。
This is where something known as known as Receive Side Scaling (RSS) or multiqueue can help.
这就是所谓的接收端缩放(RSS)或多队列技术可以发挥作用的地方。
Some devices have the ability to write incoming packets to several different regions of RAM simultaneously; each region is a separate queue. This allows the OS to use multiple CPUs to process incoming data in parallel, starting at the hardware level. This feature is not supported by all NICs.
有些设备可以将接收到的数据包同时写入 RAM 的多个不同区域;每个区域都是一个单独的队列。这样,操作系统就可以从硬件层面开始,使用多个 CPU 并行处理传入数据。并非所有 NIC 都支持此功能。
The Intel I350 NIC does support multiple queues. We can see evidence of this in the
igb
driver. One of the first things the igb
driver does when it is brought up is call a function named igb_setup_all_rx_resources
. This function calls another function, igb_setup_rx_resources
, once for each RX queue to arrange for DMA-able memory where the device will write incoming data.英特尔 I350 NIC 支持多个队列。我们可以在
igb
驱动程序中看到这一点的证据。igb
驱动程序启动时首先要做的事情之一就是调用一个名为igb_setup_all_rx_resources
的函数。这个函数会为每个 RX 队列调用另一个函数igb_setup_rx_resources
一次,为设备写入传入数据安排可进行 DMA 操作的内存。If you are curious how exactly this works, please see the Linux kernel’s DMA API HOWTO.
如果您对具体的工作方式感到好奇,请查看 Linux 内核的DMA API 指南。
It turns out the number and size of the RX queues can be tuned by using
ethtool
. Tuning these values can have a noticeable impact on the number of frames which are processed vs the number of frames which are dropped.事实证明,RX 队列的数量和大小可以使用
ethtool
进行调整。调整这些值对处理的帧数与丢弃的帧数会有显著影响。The NIC uses a hash function on the packet header fields (like source, destination, port, etc) to determine which RX queue the data should be directed to.
NIC 使用数据包头部字段(如源地址、目的地址、端口等)上的哈希函数来确定数据应该被定向到哪个 RX 队列。
Some NICs let you adjust the weight of the RX queues, so you can send more traffic to specific queues.
一些 NIC 允许您调整 RX 队列的权重,这样您就可以将更多流量发送到特定的队列。
Fewer NICs let you adjust this hash function itself. If you can adjust the hash function, you can send certain flows to specific RX queues for processing or even drop the packets at the hardware level, if desired.
更少的 NIC 允许您调整这个哈希函数本身。如果您可以调整哈希函数,您可以将特定的流量发送到特定的 RX 队列进行处理,甚至可以根据需要在硬件级别丢弃数据包。
We’ll take a look at how to tune these settings shortly.
我们稍后将了解如何调整这些设置。
Enable NAPI(启用 NAPI)
When a network device is brought up, a driver will usually enable NAPI.
当网络设备启动时,驱动程序通常会启用 NAPI。
We saw earlier how drivers register
poll
functions with NAPI, but NAPI is not usually enabled until the device is brought up.我们之前看到了驱动程序如何向 NAPI 注册
poll
函数,但 NAPI 通常在设备启动之前不会被启用。Enabling NAPI is relatively straight forward. A call to
napi_enable
will flip a bit in the struct napi_struct
to indicate that it is now enabled. As mentioned above, while NAPI will be enabled it will be in the off position.启用 NAPI 相对简单,调用
napi_enable
会在struct napi_struct
中翻转一个位,以表明它现在已启用。如前所述,虽然 NAPI 会被启用,但它将处于关闭位置。In the case of the
igb
driver, NAPI is enabled for each q_vector
that was initialized when the driver was loaded or when the queue count or size are changed with ethtool
.对于
igb
驱动程序,NAPI 会在加载驱动程序时或使用 ethtool
更改队列数或队列大小时为每个初始化的 q_vector
启用。for (i = 0; i < adapter->num_q_vectors; i++) napi_enable(&(adapter->q_vector[i]->napi));
Register an interrupt handler(注册中断处理程序)
After enabling NAPI, the next step is to register an interrupt handler. There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts. As such, the code differs from device to device depending on what the supported interrupt methods are for a particular piece of hardware.
启用 NAPI 后,下一步是注册一个中断处理程序。设备可以使用不同的方法来发出中断信号:MSI-X、MSI 和传统中断。因此,根据特定硬件支持的中断方法,不同设备的代码也有所不同。
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
驱动程序必须确定设备支持哪种方法,并注册在接收到中断时将执行的适当处理程序函数。
Some drivers, like the
igb
driver, will try to register an interrupt handler with each method, falling back to the next untested method on failure.一些驱动程序,如
igb
驱动程序,会尝试使用每种方法注册一个中断处理程序,如果失败则回退到下一个未测试的方法。MSI-X interrupts are the preferred method, especially for NICs that support multiple RX queues. This is because each RX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with
irqbalance
or by modifying /proc/irq/IRQ_NUMBER/smp_affinity
). As we’ll see shortly, the CPU that handles the interrupt will be the CPU that processes the packet. In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level up through the networking stack.MSI-X 中断是首选方法,特别是对于支持多个 RX 队列的 NIC。这是因为每个 RX 队列可以有自己的硬件中断分配,然后可以由特定的 CPU 处理(通过
irqbalance
或通过修改 /proc/irq/IRQ_NUMBER/smp_affinity
)。正如我们稍后将看到的,处理中断的 CPU 将是处理数据包的 CPU。通过这种方式,从硬件中断级别到网络栈,到达的数据包可以由不同的 CPU 进行处理。If MSI-X is unavailable, MSI still presents advantages over legacy interrupts and will be used by the driver if the device supports it. Read this useful wiki page for more information about MSI and MSI-X.
如果 MSI-X 不可用,MSI 仍然比传统中断具有优势,如果设备支持,驱动程序将使用它。有关 MSI 和 MSI-X 的更多信息,请阅读这个有用的维基页面。
In the
igb
driver, the functions igb_msix_ring
, igb_intr_msi
, igb_intr
are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.在
igb
驱动程序中,igb_msix_ring
、igb_intr_msi
、igb_intr
函数分别是 MSI-X、MSI 和传统中断模式的中断处理程序方法。You can find the code in the driver which attempts each interrupt method in drivers/net/ethernet/intel/igb/igb_main.c:
您可以在
drivers/net/ethernet/intel/igb/igb_main.c
中的驱动程序代码中找到尝试每种中断方法的代码:static int igb_request_irq(struct igb_adapter *adapter) { struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; int err = 0; if (adapter->msix_entries) { err = igb_request_msix(adapter); if (!err) goto request_done; /* fall back to MSI */ /* ... */ } /* ... */ if (adapter->flags & IGB_FLAG_HAS_MSI) { err = request_irq(pdev->irq, igb_intr_msi, 0, netdev->name, adapter); if (!err) goto request_done; /* fall back to legacy interrupts */ /* ... */ } err = request_irq(pdev->irq, igb_intr, IRQF_SHARED, netdev->name, adapter); if (err) dev_err(&pdev->dev, "Error %d getting interrupt\n", err); request_done: return err; }
As you can see in the abbreviated code above, the driver first attempts to set an MSI-X interrupt handler with
igb_request_msix
, falling back to MSI on failure. Next, request_irq
is used to register igb_intr_msi
, the MSI interrupt handler. If this fails, the driver falls back to legacy interrupts. request_irq
is used again to register the legacy interrupt handler igb_intr
.正如您在上面的缩写代码中看到的,驱动程序首先尝试使用
igb_request_msix
设置 MSI-X 中断处理程序,如果失败则回退到 MSI。接下来,使用request_irq
注册igb_intr_msi
,即 MSI 中断处理程序。如果这也失败,驱动程序回退到传统中断,再次使用request_irq
注册传统中断处理程序igb_intr
。And this is how the
igb
driver registers a function that will be executed when the NIC raises an interrupt signaling that data has arrived and is ready for processing.这就是
igb
驱动程序注册一个函数的方式,当 NIC 发出中断信号表明数据已到达并准备好进行处理时,该函数将被执行。Enable Interrupts(启用中断)
At this point, almost everything is setup. The only thing left is to enable interrupts from the NIC and wait for data to arrive. Enabling interrupts is hardware specific, but the
igb
driver does this in __igb_open
by calling a helper function named igb_irq_enable
.此时,几乎所有设置都已完成。剩下的唯一事情是启用 NIC 的中断并等待数据到达。启用中断是特定于硬件的,但
igb
驱动程序在__igb_open
中通过调用一个名为igb_irq_enable
的辅助函数来完成此操作。Interrupts are enabled for this device by writing to registers:
通过向寄存器写入来为这个设备启用中断:
static void igb_irq_enable(struct igb_adapter *adapter) { /* ... */ wr32(E1000_IMS, IMS_ENABLE_MASK | E1000_IMS_DRSTA); wr32(E1000_IAM, IMS_ENABLE_MASK | E1000_IMS_DRSTA); /* ... */ }
The network device is now up(网络设备现已启动)
Drivers may do a few more things like start timers, work queues, or other hardware-specific setup. Once that is completed. the network device is up and ready for use.
驱动程序可能会执行一些其他操作,如启动定时器、工作队列或其他特定于硬件的设置。一旦完成这些操作,网络设备就启动并准备好使用了。
Let’s take a look at monitoring and tuning settings for network device drivers.
让我们来看看网络设备驱动程序的监控和调整设置。
Monitoring network devices(监控网络设备)
There are several different ways to monitor your network devices offering different levels of granularity and complexity. Let’s start with most granular and move to least granular.
有几种不同的方法可以监控网络设备,它们提供不同程度的粒度和复杂度。让我们从最细粒度的方法开始,逐步转向最粗粒度的方法。
Using ethtool -S
(使用 ethtool -S)
You can install
ethtool
on an Ubuntu system by running: sudo apt-get install ethtool
.您可以在 Ubuntu 系统上通过运行
sudo apt-get install ethtool
来安装ethtool
。Once it is installed, you can access the statistics by passing the
-S
flag along with the name of the network device you want statistics about.安装完成后,您可以通过传递
-S
标志以及您想要获取统计信息的网络设备名称来访问这些统计信息。Monitor detailed NIC device statistics (e.g., packet drops) with
ethtool -S
使用
ethtool -S
监控详细的 NIC 设备统计信息(例如,数据包丢弃情况):$ sudo ethtool -S eth0 NIC statistics: rx_packets: 597028087 tx_packets: 5924278060 rx_bytes: 112643393747 tx_bytes: 990080156714 rx_broadcast: 96 tx_broadcast: 116 rx_multicast: 20294528 ....
Monitoring this data can be difficult. It is easy to obtain, but there is no standardization of the field values. Different drivers, or even different versions of the same driver might produce different field names that have the same meaning.
监控这些数据可能很困难。虽然数据很容易获取,但字段值没有标准化。不同的驱动程序,甚至同一驱动程序的不同版本,可能会产生具有相同含义但名称不同的字段。
You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next, you will have to read your driver source. You’ll be able to determine which values are accounted for totally in software (e.g., incremented when there is no memory) and which values come directly from hardware via a register read. In the case of a register value, you should consult the data sheet for your hardware to determine what the meaning of the counter really is; many of the labels given via
ethtool
can be misleading.您应该查找标签中带有 “drop”(丢弃)、“buffer”(缓冲区)、“miss”(错过)等字样的值。 接下来,您必须阅读驱动程序源代码,这样您就能确定哪些值完全在软件中统计(例如,在没有内存时递增),哪些值是通过寄存器读取直接从硬件获取的。 对于寄存器值,您应该查阅硬件的数据表,以确定计数器的真正含义; 通过
ethtool
给出的许多标签可能会产生误导。Using sysfs(使用 sysfs)
sysfs also provides a lot of statistics values, but they are slightly higher level than the direct NIC level stats provided.
sysfs 也提供了许多统计值,但它们比直接的 NIC 级统计信息稍微高级一些。
You can find the number of dropped incoming network data frames for, e.g. eth0 by using
cat
on a file.例如,您可以使用
cat
命令查看eth0
的传入网络数据帧的丢弃数量。Monitor higher level NIC statistics with sysfs.
使用 sysfs 监控更高级别的 NIC 统计信息:
$ cat /sys/class/net/eth0/statistics/rx_dropped 2
The counter values will be split into files like
collisions
, rx_dropped
, rx_errors
, rx_missed_errors
, etc.计数器值会被分割到像
collisions
(冲突)、rx_dropped
(接收丢弃)、rx_errors
(接收错误)、rx_missed_errors
(接收错过错误)等文件中。Unfortunately, it is up to the drivers to decide what the meaning of each field is, and thus, when to increment them and where the values come from. You may notice that some drivers count a certain type of error condition as a drop, but other drivers may count the same as a miss.
不幸的是,每个字段的含义由驱动程序决定,因此,何时递增这些字段以及这些值的来源也由驱动程序决定。您可能会注意到,一些驱动程序将某种类型的错误情况计为丢弃,而其他驱动程序可能将相同情况计为错过。
If these values are critical to you, you will need to read your driver source to understand exactly what your driver thinks each of these values means.
如果这些值对您很关键,您需要阅读驱动程序源代码,以准确理解您的驱动程序对每个这些值的理解。
Using /proc/net/dev
(使用 /proc/net/dev)
An even higher level file is
/proc/net/dev
which provides high-level summary-esque information for each network adapter on the system.更高级别的文件是
/proc/net/dev
,它为系统上的每个网络适配器提供高级别的汇总信息。Monitor high level NIC statistics by reading
/proc/net/dev
.通过读取
/proc/net/dev
监控高级别的 NIC 统计信息:$ cat /proc/net/dev Inter-| Receive | Transmit face | bytes packets errs drop fifo frame compressed multicast | bytes packets errs drop fifo colls carrier compressed eth0: 110346752214 597737500 0 2 0 0 0 20963860 990024805984 6066582604 0 0 0 0 0 0 lo: 428349463836 1579868535 0 0 0 0 0 0 428349463836 1579868535 0 0 0 0 0 0
This file shows a subset of the values you’ll find in the sysfs files mentioned above, but it may serve as a useful general reference.
这个文件显示了您在上面提到的 sysfs 文件中会找到的值的一个子集,但它可能是一个有用的通用参考。
The caveat mentioned above applies here, as well: if these values are important to you, you will still need to read your driver source to understand exactly when, where, and why they are incremented to ensure your understanding of an error, drop, or fifo are the same as your driver.
上面提到的注意事项在此处同样适用:如果这些值对您很重要,您仍然需要阅读驱动程序源代码,以准确了解它们何时、何地以及为何递增,以确保您对错误、丢包或 FIFO(先进先出队列)的理解与驱动程序一致。
Tuning network devices(调整网络设备)
Check the number of RX queues being used(检查正在使用的 RX 队列数量)
If your NIC and the device driver loaded on your system support RSS / multiqueue, you can usually adjust the number of RX queues (also called RX channels), by using
ethtool
.如果你的 NIC 和系统上加载的设备驱动程序支持 RSS(接收端缩放)/ 多队列功能,通常可以使用
ethtool
调整 RX 队列(也称为 RX 通道)的数量。Check the number of NIC receive queues with
ethtool
使用
ethtool
检查 NIC 接收队列的数量:$ sudo ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and the hardware) and the current settings.
此输出显示了预设的最大值(由驱动程序和硬件强制执行)和当前设置。
Note: not all device drivers will have support for this operation.
注意:并非所有设备驱动程序都支持此操作。
Error seen if your NIC doesn't support this operation.
如果你的 NIC 不支持此操作,会看到以下错误:
$ sudo ethtool -l eth0 Channel parameters for eth0: Cannot get device channel parameters : Operation not supported
This means that your driver has not implemented the ethtool
get_channels
operation. This could be because the NIC doesn’t support adjusting the number of queues, doesn’t support RSS / multiqueue, or your driver has not been updated to handle this feature.这意味着你的驱动程序未实现
ethtool
的get_channels
操作。这可能是因为 NIC 不支持调整队列数量、不支持 RSS / 多队列功能,或者你的驱动程序尚未更新以处理此功能。Adjusting the number of RX queues(调整 RX 队列数量)
Once you’ve found the current and maximum queue count, you can adjust the values by using
sudo ethtool -L
.找到当前和最大队列数量后,可以使用
sudo ethtool -L
调整这些值。Note: some devices and their drivers only support combined queues that are paired for transmit and receive, as in the example in the above section.
注意:一些设备及其驱动程序仅支持成对的发送和接收组合队列,如上面部分中的示例。
Set combined NIC transmit and receive queues to 8 with
ethtool -L
使用
ethtool -L
将 NIC 的组合发送和接收队列设置为 8:$ sudo ethtool -L eth0 combined 8
If your device and driver support individual settings for RX and TX and you’d like to change only the RX queue count to 8, you would run:
Set the number of NIC receive queues to 8 with
ethtool -L
.如果你的设备和驱动程序支持对 RX 和 TX 进行单独设置,并且你只想将 RX 队列数量更改为 8,可以运行:
$ sudo ethtool -L eth0 rx 8
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
注意:对于大多数驱动程序,进行这些更改会使网络接口先关闭再重新启动,与该接口的连接将被中断。不过,对于一次性更改而言,这可能影响不大。
Adjusting the size of the RX queues(调整 RX 队列大小)
Some NICs and their drivers also support adjusting the size of the RX queue. Exactly how this works is hardware specific, but luckily
ethtool
provides a generic way for users to adjust the size. Increasing the size of the RX queue can help prevent network data drops at the NIC during periods where large numbers of data frames are received. Data may still be dropped in software, though, and other tuning is required to reduce or eliminate drops completely.一些 NIC 及其驱动程序还支持调整 RX 队列的大小。具体的实现方式因硬件而异,但幸运的是,
ethtool
为用户提供了一种通用的调整大小的方法。增加 RX 队列的大小有助于防止在接收大量数据帧期间 NIC 丢弃网络数据。不过,数据仍可能在软件层面被丢弃,还需要进行其他调整以减少或完全消除丢包。Check current NIC queue sizes with
ethtool -g
使用
ethtool -g
检查当前 NIC 队列大小:$ sudo ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512
the above output indicates that the hardware supports up to 4096 receive and transmit descriptors, but it is currently only using 512.
上述输出表明,硬件最多支持 4096 个接收和传输描述符,但目前仅使用了 512 个。
Increase size of each RX queue to 4096 with
ethtool -G
使用
ethtool -G
将每个 RX 队列的大小增加到 4096:$ sudo ethtool -G eth0 rx 4096
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
注意:对于大多数驱动程序,进行这些更改会使网络接口先关闭再重新启动,与该接口的连接将被中断。不过,对于一次性更改而言,这可能影响不大。
Adjusting the processing weight of RX queues(调整 RX 队列的处理权重)
Some NICs support the ability to adjust the distribution of network data among the RX queues by setting a weight.
一些 NIC 支持通过设置权重来调整网络数据在 RX 队列之间的分配。
You can configure this if:
- Your NIC supports flow indirection.
- Your driver implements the
ethtool
functionsget_rxfh_indir_size
andget_rxfh_indir
.
- You are running a new enough version of
ethtool
that has support for the command line optionsx
andX
to show and set the indirection table, respectively.
如果满足以下条件,你可以配置此设置:
- 你的 NIC 支持流间接功能。
- 你的驱动程序实现了
ethtool
函数get_rxfh_indir_size
和get_rxfh_indir
。
- 你运行的
ethtool
版本足够新,支持命令行选项x
和X
,分别用于显示和设置间接表。
Check the RX flow indirection table with
ethtool -x
使用
ethtool -x
检查 RX 流间接表:$ sudo ethtool -x eth0 RX flow hash indirection table for eth3 with 2 RX ring(s): 0: 0 1 0 1 0 1 0 1 8: 0 1 0 1 0 1 0 1 16: 0 1 0 1 0 1 0 1 24: 0 1 0 1 0 1 0 1
This output shows packet hash values on the left, with receive queue 0 and 1 listed. So, a packet which hashes to 2 will be delivered to receive queue 0, while a packet which hashes to 3 will be delivered to receive queue 1.
此输出在左侧显示数据包哈希值,右侧列出接收队列 0 和 1。因此,哈希值为 2 的数据包将被发送到接收队列 0,而哈希值为 3 的数据包将被发送到接收队列 1。
Example: spread processing evenly between first 2 RX queues
示例:在最初的 2 个 RX 队列之间平均分配处理任务
$ sudo ethtool -X eth0 equal 2
If you want to set custom weights to alter the number of packets which hit certain receive queues (and thus CPUs), you can specify those on the command line, as well:
如果你想设置自定义权重以改变到达特定接收队列(进而到达特定 CPU)的数据包数量,也可以在命令行中指定这些权重:
Set custom RX queue weights with
ethtool -X
使用
ethtool -X
设置自定义 RX 队列权重:$ sudo ethtool -X eth0 weight 6 2
The above command specifies a weight of 6 for rx queue 0 and 2 for rx queue 1, pushing much more data to be processed on queue 0.
上述命令为 rx 队列 0 指定权重为 6,为 rx 队列 1 指定权重为 2,从而使更多数据在队列 0 上进行处理。
Some NICs will also let you adjust the fields which be used in the hash algorithm, as we’ll see now.
一些 NIC 还允许你调整哈希算法中使用的字段,我们现在来了解一下。
Adjusting the rx hash fields for network flows(调整网络流的 rx 哈希字段)
You can use
ethtool
to adjust the fields that will be used when computing a hash for use with RSS.你可以使用
ethtool
调整用于计算 RSS 哈希值时使用的字段。Check which fields are used for UDP RX flow hash with
ethtool -n
.使用
ethtool -n
检查用于 UDP RX 流哈希的字段:$ sudo ethtool -n eth0 rx-flow-hash udp4 UDP over IPV4 flows use these fields for computing Hash flow key: IP SA IP DA
For eth0, the fields that are used for computing a hash on UDP flows is the IPv4 source and destination addresses. Let’s include the source and destination ports:
对于 eth0,用于计算 UDP 流哈希值的字段是 IPv4 源地址和目的地址。让我们添加源端口和目的端口:
Set UDP RX flow hash fields with
ethtool -N
.使用
ethtool -N
设置 UDP RX 流哈希字段:$ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn
The
sdfn
string is a bit cryptic; check the ethtool
man page for an explanation of each letter.sdfn
这个字符串有点晦涩难懂,有关每个字母的解释,请查看ethtool
的手册页。Adjusting the fields to take a hash on is useful, but
ntuple
filtering is even more useful for finer grained control over which flows will be handled by which RX queue.调整用于哈希计算的字段很有用,但
ntuple
过滤对于更精细地控制哪些流由哪个 RX 队列处理更为有用。ntuple filtering for steering network flows(ntuple 过滤以引导网络流)
Some NICs support a feature known as “ntuple filtering.” This feature allows the user to specify (via
ethtool
) a set of parameters to use to filter incoming network data in hardware and queue it to a particular RX queue. For example, the user can specify that TCP packets destined to a particular port should be sent to RX queue 1.一些 NIC 支持一种称为 “ntuple 过滤” 的功能。此功能允许用户通过
ethtool
指定一组参数,在硬件中对传入的网络数据进行过滤,并将其排队到特定的 RX 队列。例如,用户可以指定目标端口为特定端口的 TCP 数据包应发送到 RX 队列 1。On Intel NICs this feature is commonly known as Intel Ethernet Flow Director. Other NIC vendors may have other marketing names for this feature.
在英特尔 NIC 上,此功能通常称为英特尔以太网流导向器。其他 NIC 供应商可能对此功能有不同的营销名称。
As we’ll see later, ntuple filtering is a crucial component of another feature called Accelerated Receive Flow Steering (aRFS), which makes using ntuple much easier if your NIC supports it. aRFS will be covered later.
正如我们稍后将看到的,ntuple 过滤是另一个称为加速接收流导向(aRFS)的功能的关键组成部分,如果你的 NIC 支持 aRFS,它会使使用 ntuple 变得更加容易。稍后将介绍 aRFS。
This feature can be useful if the operational requirements of the system involve maximizing data locality with the hope of increasing CPU cache hit rates when processing network data. For example consider the following configuration for a webserver running on port 80:
如果系统的操作要求涉及最大化数据局部性,以期在处理网络数据时提高 CPU 缓存命中率,那么此功能会很有用。例如,考虑在端口 80 上运行的 Web 服务器的以下配置:
- A webserver running on port 80 is pinned to run on CPU 2.
- IRQs for an RX queue are assigned to be processed by CPU 2.
- TCP traffic destined to port 80 is ‘filtered’ with ntuple to CPU 2.
- All incoming traffic to port 80 is then processed by CPU 2 starting at data arrival to the userland program.
- Careful monitoring of the system including cache hit rates and networking stack latency will be needed to determine effectiveness.
- 在端口 80 上运行的 Web 服务器被绑定到 CPU 2 上运行。
- RX 队列的 IRQ 被分配由 CPU 2 处理。
- 目标端口为 80 的 TCP 流量通过 ntuple “过滤” 到 CPU 2。
- 从数据到达用户态程序开始,所有发往端口 80 的传入流量都由 CPU 2 处理。
- 需要仔细监控系统,包括缓存命中率和网络栈延迟,以确定其有效性。
As mentioned, ntuple filtering can be configured with
ethtool
, but first, you’ll need to ensure that this feature is enabled on your device.如前所述,可以使用
ethtool
配置 ntuple 过滤,但首先,你需要确保设备上启用了此功能。Check if ntuple filters are enabled with
ethtool -k
使用
ethtool -k
检查 ntuple 过滤器是否启用:$ sudo ethtool -k eth0 Offload parameters for eth0: ... ntuple-filters: off receive-hashing: on
As you can see,
ntuple-filters
are set to off on this device.如你所见,此设备上的
ntuple-filters
设置为关闭。Enable ntuple filters with
ethtool -K
使用
ethtool -K
启用 ntuple 过滤器:$ sudo ethtool -K eth0 ntuple on
Once you’ve enabled ntuple filters, or verified that it is enabled, you can check the existing ntuple rules by using
ethtool
:启用 ntuple 过滤器后,或者确认其已启用后,可以使用
ethtool
检查现有的 ntuple 规则:Check existing ntuple filters with
ethtool -u
使用
ethtool -u
检查现有的 ntuple 过滤器:$ sudo ethtool -u eth0 40 RX rings available Total 0 rules
As you can see, this device has no ntuple filter rules. You can add a rule by specifying it on the command line to
ethtool
. Let’s add a rule to direct all TCP traffic with a destination port of 80 to RX queue 2:如你所见,此设备没有 ntuple 过滤规则。你可以在命令行中向
ethtool
指定规则来添加一个。让我们添加一个规则,将所有目标端口为 80 的 TCP 流量定向到 RX 队列 2:Add ntuple filter to send TCP flows with destination port 80 to RX queue 2
$ sudo ethtool -U eth0 flow-type tcp4 dst-port 80 action 2
You can also use ntuple filtering to drop packets for particular flows at the hardware level. This can be useful for mitigating heavy incoming traffic from specific IP addresses. For more information about configuring ntuple filter rules, see the
ethtool
man page.你还可以使用 ntuple 过滤在硬件级别丢弃特定流的数据包。这对于减轻来自特定 IP 地址的大量传入流量很有用。有关配置 ntuple 过滤规则的更多信息,请查看
ethtool
的手册页。You can usually get statistics about the success (or failure) of your ntuple rules by checking values output from
ethtool -S [device name]
. For example, on Intel NICs, the statistics fdir_match
and fdir_miss
calculate the number of matches and misses for your ntuple filtering rules. Consult your device driver source and device data sheet for tracking down statistics counters (if available).通常,你可以通过检查
ethtool -S [设备名称]
输出的值来获取 ntuple 规则成功(或失败)的统计信息。例如,在英特尔 NIC 上,fdir_match
和fdir_miss
统计信息计算 ntuple 过滤规则的匹配次数和未匹配次数。查阅设备驱动程序源代码和设备数据表,以查找统计计数器(如果可用)。SoftIRQs(软中断)
Before examining the network stack, we’ll need to take a short detour to examine something in the Linux kernel called SoftIRQs.
在研究网络栈之前,我们需要先绕个小弯,研究一下 Linux 内核中一个叫做软中断(SoftIRQs)的东西。
What is a softirq?(什么是软中断?)
The softirq system in the Linux kernel is a mechanism for executing code outside of the context of an interrupt handler implemented in a driver. This system is important because hardware interrupts may be disabled during all or part of the execution of an interrupt handler. The longer interrupts are disabled, the greater chance that events may be missed. So, it is important to defer any long running actions outside of the interrupt handler so that it can complete as quickly as possible and re-enable interrupts from the device.
Linux 内核中的软中断系统是一种在驱动程序实现的中断处理程序上下文之外执行代码的机制。这个系统很重要,因为在中断处理程序执行的全部或部分时间内,硬件中断可能会被禁用。中断被禁用的时间越长,错过事件的可能性就越大。因此,将任何长时间运行的操作推迟到中断处理程序之外执行非常重要,这样中断处理程序就可以尽快完成,并重新启用设备的中断。
There are other mechanisms that can be used for deferring work in the kernel, but for the purposes of the networking stack, we’ll be looking at softirqs.
在内核中还有其他机制可用于推迟工作,但就网络栈而言,我们将关注软中断。
The softirq system can be imagined as a series of kernel threads (one per CPU) that run handler functions which have been registered for different softirq events. If you’ve ever looked at top and seen
ksoftirqd/0
in the list of kernel threads, you were looking at the softirq kernel thread running on CPU 0.软中断系统可以想象为一系列内核线程(每个 CPU 一个),它们运行针对不同软中断事件注册的处理函数。如果你曾经在
top
命令中看到ksoftirqd/0
在内核线程列表中,那么你看到的就是在 CPU 0 上运行的软中断内核线程。Kernel subsystems (like networking) can register a softirq handler by executing the
open_softirq
function. We’ll see later how the networking system registers its softirq handlers. For now, let’s learn a bit more about how softirqs work.内核子系统(如网络子系统)可以通过执行
open_softirq
函数注册一个软中断处理程序。稍后我们将看到网络系统如何注册其软中断处理程序。现在,让我们进一步了解软中断的工作原理。ksoftirqd
Since softirqs are so important for deferring the work of device drivers, you might imagine that the
ksoftirqd
process is spawned pretty early in the life cycle of the kernel and you’d be correct.由于软中断对于推迟设备驱动程序的工作非常重要,你可能会认为
ksoftirqd
进程在内核生命周期的早期就会被创建,你想得没错。Looking at the code found in kernel/softirq.c reveals how the
ksoftirqd
system is initialized:查看
kernel/softirq.c
中的代码,可以了解ksoftirqd
系统是如何初始化的:static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd, .thread_comm = "ksoftirqd/%u", }; static __init int spawn_ksoftirqd(void) { register_cpu_notifier(&cpu_nfb); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); return 0; } early_initcall(spawn_ksoftirqd);
As you can see from the
struct smp_hotplug_thread
definition above, there are two function pointers being registered: ksoftirqd_should_run
and run_ksoftirqd
.从上面的
struct smp_hotplug_thread
定义中可以看到,有两个函数指针被注册: ksoftirqd_should_run
和 run_ksoftirqd
。Both of these functions are called from kernel/smpboot.c as part of something which resembles an event loop.
这两个函数都在内核的
smpboot.c
中被调用,作为类似事件循环的一部分。The code in
kernel/smpboot.c
first calls ksoftirqd_should_run
which determines if there are any pending softirqs and, if there are pending softirqs, run_ksoftirqd
is executed. The run_ksoftirqd
does some minor bookkeeping before it calls __do_softirq
.smpboot.c
中的代码首先调用ksoftirqd_should_run
,它会确定是否有任何挂起的软中断,如果有,则执行run_ksoftirqd
。run_ksoftirqd
在调用__do_softirq
之前会进行一些小的簿记工作。__do_softirq
The
__do_softirq
function does a few interesting things:- determines which softirq is pending
- softirq time is accounted for statistics purposes
- softirq execution statistics are incremented
- the softirq handler for the pending softirq (which was registered with a call to
open_softirq
) is executed.
_do_softirq
函数执行了一些有趣的操作:- 确定哪个软中断处于挂起状态。
- 统计软中断时间,用于统计目的。
- 增加软中断执行统计信息。
- 执行针对挂起软中断注册的软中断处理程序(通过调用
open_softirq
注册)。
So, when you look at graphs of CPU usage and see
softirq
or si
you now know that this is measuring the amount of CPU usage happening in a deferred work context.所以,当你查看 CPU 使用情况图表并看到
softirq
或si
时,现在你知道这是在测量在推迟工作上下文中发生的 CPU 使用量。Monitoring(监控)
/proc/softirqs
The
softirq
system increments statistic counters which can be read from /proc/softirqs
Monitoring these statistics can give you a sense for the rate at which softirqs for various events are being generated.软中断系统会增加统计计数器,可以从
/proc/softirqs
读取这些计数器。监控这些统计信息可以让你了解各种事件的软中断生成速率。Check softIRQ stats by reading
/proc/softirqs
.通过读取
/proc/softirqs
检查软中断统计信息:$ cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 0 0 0 0 TIMER: 2831512516 1337085411 1103326083 1423923272 NET_TX: 15774435 779806 733217 749512 NET_RX: 1671622615 1257853535 2088429526 2674732223 BLOCK: 1800253852 1466177 1791366 634534 BLOCK_IOPOLL: 0 0 0 0 TASKLET: 25 0 0 0 SCHED: 2642378225 1711756029 629040543 682215771 HRTIMER: 2547911 2046898 1558136 1521176 RCU: 2056528783 4231862865 3545088730 844379888
This file can give you an idea of how your network receive (
NET_RX
) processing is currently distributed across your CPUs. If it is distributed unevenly, you will see a larger count value for some CPUs than others. This is one indicator that you might be able to benefit from Receive Packet Steering / Receive Flow Steering described below. Be careful using just this file when monitoring your performance: during periods of high network activity you would expect to see the rate NET_RX
increments increase, but this isn’t necessarily the case. It turns out that this is a bit nuanced, because there are additional tuning knobs in the network stack that can affect the rate at which NET_RX
softirqs will fire, which we’ll see soon.这个文件可以让你了解网络接收(
NET_RX
)处理当前在 CPU 之间的分布情况。如果分布不均匀,你会看到某些 CPU 的计数值比其他 CPU 大。这是一个指标,表明你可能会从下面描述的接收数据包导向 / 接收流导向中受益。在监控性能时,仅使用这个文件要小心:在网络活动高峰期,你可能期望看到NET_RX
的增量速率增加,但实际情况并非一定如此。事实证明,这有点微妙,因为网络栈中有其他调整旋钮会影响NET_RX
软中断的触发速率,我们很快就会看到。You should be aware of this, however, so that if you adjust the other tuning knobs you will know to examine
/proc/softirqs
and expect to see a change.不过,你应该意识到这一点,这样在调整其他调整旋钮时,你就会知道检查
/proc/softirqs
,并期望看到变化。Now, let’s move on to the networking stack and trace how network data is received from top to bottom.
现在,让我们进入网络栈,跟踪网络数据从顶层到底层的接收过程。
Linux network device subsystem(Linux 网络设备子系统)
Now that we’ve taken a look in to how network drivers and softirqs work, let’s see how the Linux network device subsystem is initialized. Then, we can follow the path of a packet starting with its arrival.
现在我们已经了解了网络驱动程序和软中断的工作原理,让我们看看 Linux 网络设备子系统是如何初始化的。然后,我们可以跟踪数据包从到达开始的路径。
Initialization of network device subsystem(网络设备子系统的初始化)
The network device (netdev) subsystem is initialized in the function
net_dev_init
. Lots of interesting things happen in this initialization function.网络设备(netdev)子系统在
net_dev_init
函数中初始化。在这个初始化函数中发生了很多有趣的事情。Initialization of struct softnet_data
structures(struct softnet_data 结构的初始化)
net_dev_init
creates a set of struct softnet_data
structures for each CPU on the system. These structures will hold pointers to several important things for processing network data:net_dev_init
为系统中的每个 CPU 创建一组struct softnet_data
结构。这些结构将保存处理网络数据所需的几个重要指针:- List for NAPI structures to be registered to this CPU.
- A backlog for data processing.
- The processing
weight
.
- The receive offload structure list.
- Receive packet steering settings.
- And more.
- 要注册到这个 CPU 的 NAPI 结构列表。
- 数据处理的积压队列。
- 处理
weight
。
- 接收卸载结构列表。
- 接收数据包导向设置。
- 还有更多。
Each of these will be examined in greater detail later as we progress up the stack.
随着我们在栈中向上推进,后面将更详细地研究这些内容。
Initialization of softirq handlers(软中断处理程序的初始化)
net_dev_init
registers a transmit and receive softirq handler which will be used to process incoming or outgoing network data. The code for this is pretty straight forward:net_dev_init
注册一个传输和接收软中断处理程序,用于处理传入或传出的网络数据。相关代码非常直接:static int __init net_dev_init(void) { /* ... */ open_softirq(NET_TX_SOFTIRQ, net_tx_action); open_softirq(NET_RX_SOFTIRQ, net_rx_action); /* ... */ }
We’ll see soon how the driver’s interrupt handler will “raise” (or trigger) the
net_rx_action
function registered to the NET_RX_SOFTIRQ
softirq.我们很快就会看到驱动程序的中断处理程序如何 “触发”(或调用)注册到
NET_RX_SOFTIRQ
软中断的net_rx_action
函数。Data arrives(数据到达)
At long last; network data arrives!
终于,网络数据到达了!
Assuming that the RX queue has enough available descriptors, the packet is written to RAM via DMA. The device then raises the interrupt that is assigned to it (or in the case of MSI-X, the interrupt tied to the rx queue the packet arrived on).
假设 RX 队列有足够的可用描述符,数据包将通过 DMA 写入内存。然后设备会发出分配给它的中断(在 MSI-X 的情况下,是与数据包到达的 rx 队列相关联的中断)。
Interrupt handler(中断处理程序)
In general, the interrupt handler which runs when an interrupt is raised should try to defer as much processing as possible to happen outside the interrupt context. This is crucial because while an interrupt is being processed, other interrupts may be blocked.
一般来说,当一个中断被触发时运行的中断处理程序应该尽量将尽可能多的处理工作推迟到中断上下文之外进行。这一点至关重要,因为在处理一个中断时,其他中断可能会被阻塞。
Let’s take a look at the source for the MSI-X interrupt handler; it will really help illustrate the idea that the interrupt handler does as little work as possible.
让我们看一下 MSI-X 中断处理程序的源代码,这将真正有助于说明中断处理程序尽量少做工作的理念。在
drivers/net/ethernet/intel/igb/igb_main.c
中:static irqreturn_t igb_msix_ring(int irq, void *data) { struct igb_q_vector *q_vector = data; /* Write the ITR value calculated from the previous interrupt. */ igb_write_itr(q_vector); napi_schedule(&q_vector->napi); return IRQ_HANDLED; }
This interrupt handler is very short and performs 2 very quick operations before returning.
这个中断处理程序非常简短,在返回之前执行了两个非常快速的操作。
First, this function calls
igb_write_itr
which simply updates a hardware specific register. In this case, the register that is updated is one which is used to track the rate hardware interrupts are arriving.首先,这个函数调用
igb_write_itr
,它只是更新一个特定于硬件的寄存器。在这种情况下,更新的寄存器用于跟踪硬件中断的到达速率。This register is used in conjunction with a hardware feature called “Interrupt Throttling” (also called “Interrupt Coalescing”) which can be used to to pace the delivery of interrupts to the CPU. We’ll see soon how
ethtool
provides a mechanism for adjusting the rate at which IRQs fire.这个寄存器与一种称为 “中断节流”(也称为 “中断合并”)的硬件功能结合使用,可用于控制中断向 CPU 的传递速率。我们很快就会看到
ethtool
如何提供一种调整 IRQ 触发速率的机制。Secondly,
napi_schedule
is called which wakes up the NAPI processing loop if it was not already active. Note that the NAPI processing loop executes in a softirq; the NAPI processing loop does not execute from the interrupt handler. The interrupt handler simply causes it to start executing if it was not already.其次,调用
napi_schedule
,如果 NAPI 处理循环尚未激活,它将唤醒该循环。请注意,NAPI 处理循环在软中断中执行,而不是在中断处理程序中执行。中断处理程序只是在 NAPI 处理循环未运行时使其开始执行。The actual code showing exactly how this works is important; it will guide our understanding of how network data is processed on multi-CPU systems.
实际展示这一过程的代码很重要,它将指导我们理解在多 CPU 系统中网络数据是如何处理的。
NAPI and napi_schedule
Let’s figure out how the
napi_schedule
call from the hardware interrupt handler works.让我们弄清楚硬件中断处理程序中的
napi_schedule
调用是如何工作的。Remember, NAPI exists specifically to harvest network data without needing interrupts from the NIC to signal that data is ready for processing. As mentioned earlier, the NAPI
poll
loop is bootstrapped by receiving a hardware interrupt. In other words: NAPI is enabled, but off, until the first packet arrives at which point the NIC raises an IRQ and NAPI is started. There are a few other cases, as we’ll see soon, where NAPI can be disabled and will need a hardware interrupt to be raised before it will be started again.请记住,NAPI 的存在是为了在不需要 NIC 发出中断信号来表明数据已准备好处理的情况下收集网络数据。如前所述,NAPI 的
poll
循环是由接收硬件中断启动的。换句话说,NAPI 已启用但处于关闭状态,直到第一个数据包到达,此时 NIC 发出 IRQ,NAPI 才会启动。还有其他一些情况,我们很快就会看到,NAPI 可能会被禁用,并且需要硬件中断才能再次启动。The NAPI poll loop is started when the interrupt handler in the driver calls
napi_schedule
. napi_schedule
is actually just a wrapper function defined in a header file which calls down to __napi_schedule
.当驱动程序中的中断处理程序调用
napi_schedule
时,NAPI 轮询循环就会启动。napi_schedule
实际上只是一个在头文件中定义的包装函数,它会调用__napi_schedule
。在net/core/dev.c
中:From net/core/dev.c:
/** * __napi_schedule - schedule for receive * @n: entry to schedule * * The entry's receive function will be scheduled to run */ void __napi_schedule(struct napi_struct *n) { unsigned long flags; local_irq_save(flags); ____napi_schedule(&__get_cpu_var(softnet_data), n); local_irq_restore(flags); } EXPORT_SYMBOL(__napi_schedule);
This code is using
__get_cpu_var
to get the softnet_data
structure that is registered to the current CPU. This softnet_data
structure and the struct napi_struct
structure handed up from the driver are passed into ____napi_schedule
. Wow, that’s a lot of underscores ;)这段代码使用
__get_cpu_var
获取注册到当前 CPU 的softnet_data
结构。这个softnet_data
结构和从驱动程序传递上来的struct napi_struct
结构被传递给____napi_schedule
。哇,好多下划线呢!Let’s take a look at
____napi_schedule
, from net/core/dev.c:让我们看一下
____napi_schedule
,在net/core/dev.c
中:/* Called with irq disabled */ static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi) { list_add_tail(&napi->poll_list, &sd->poll_list); __raise_softirq_irqoff(NET_RX_SOFTIRQ); }
This code does two important things:
这段代码做了两件重要的事情:
- The
struct napi_struct
handed up from the device driver’s interrupt handler code is added to thepoll_list
attached to thesoftnet_data
structure associated with the current CPU.
__raise_softirq_irqoff
is used to “raise” (or trigger) a NET_RX_SOFTIRQ softirq. This will cause thenet_rx_action
registered during the network device subsystem initialization to be executed, if it’s not currently being executed.
- 从设备驱动程序的中断处理程序代码传递上来的
struct napi_struct
被添加到与当前 CPU 相关联的softnet_data
结构的poll_list
中。
- 使用
__raise_softirq_irqoff
“触发”(或调用)一个NET_RX_SOFTIRQ
软中断。这将导致在网络设备子系统初始化期间注册的net_rx_action
被执行(前提是它当前没有正在执行)。
As we’ll see shortly, the softirq handler function
net_rx_action
will call the NAPI poll
function to harvest packets.正如我们稍后将看到的,软中断处理函数
net_rx_action
将调用 NAPI 的poll
函数来收集数据包。A note about CPU and network data processing(关于 CPU 和网络数据处理的说明)
Note that all the code we’ve seen so far to defer work from a hardware interrupt handler to a softirq has been using structures associated with the current CPU.
请注意,到目前为止我们看到的将工作从硬件中断处理程序推迟到软中断的所有代码,都使用了与当前 CPU 相关联的结构。
While the driver’s IRQ handler itself does very little work itself, the softirq handler will execute on the same CPU as the driver’s IRQ handler.
虽然驱动程序的 IRQ 处理程序本身做的工作很少,但软中断处理程序将在与驱动程序的 IRQ 处理程序相同的 CPU 上执行。
This why setting the CPU a particular IRQ will be handled by is important: that CPU will be used not only to execute the interrupt handler in the driver, but the same CPU will also be used when harvesting packets in a softirq via NAPI.
这就是为什么设置特定 IRQ 将由哪个 CPU 处理很重要的原因:这个 CPU 不仅将用于执行驱动程序中的中断处理程序,还将用于通过 NAPI 在软中断中收集数据包。
As we’ll see later, things like Receive Packet Steering can distribute some of this work to other CPUs further up the network stack.
正如我们稍后将看到的,诸如接收数据包导向(Receive Packet Steering)之类的技术可以将部分工作分配到网络栈中更上层的其他 CPU 上。
Monitoring network data arrival(监控网络数据到达)
Hardware interrupt requests
硬件中断请求
Note: monitoring hardware IRQs does not give a complete picture of packet processing health. Many drivers turn off hardware IRQs while NAPI is running, as we'll see later. It is one important part of your whole monitoring solution.
注意:监控硬件 IRQ 并不能完全反映数据包处理的健康状况。正如我们稍后将看到的,许多驱动程序在 NAPI 运行时会关闭硬件 IRQ。它只是整个监控解决方案的一个重要部分。
Check hardware interrupt stats by reading
/proc/interrupts
.通过读取
/proc/interrupts
检查硬件中断统计信息:$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 46 0 0 0 IR-IO-APIC-edge timer 1: 3 0 0 0 IR-IO-APIC-edge i8042 30: 3361234770 0 0 0 IR-IO-APIC-fasteoi aacraid 64: 0 0 0 0 DMAR_MSI-edge dmar0 65: 1 0 0 0 IR-PCI-MSI-edge eth0 66: 863649703 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 67: 986285573 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 68: 45 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 69: 394 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 NMI: 9729927 4008190 3068645 3375402 Non-maskable interrupts LOC: 2913290785 1585321306 1495872829 1803524526 Local timer interrupts
You can monitor the statistics in
/proc/interrupts
to see how the number and rate of hardware interrupts change as packets arrive and to ensure that each RX queue for your NIC is being handled by an appropriate CPU. As we’ll see shortly, this number only tells us how many hardware interrupts have happened, but it is not necessarily a good metric for understanding how much data has been received or processed as many drivers will disable NIC IRQs as part of their contract with the NAPI subsystem. Further, using interrupt coalescing will also affect the statistics gathered from this file. Monitoring this file can help you determine if the interrupt coalescing settings you select are actually working.你可以监控
/proc/interrupts
中的统计信息,以查看随着数据包的到达,硬件中断的数量和速率如何变化,并确保 NIC 的每个 RX 队列都由适当的 CPU 处理。正如我们稍后将看到的,这个数字仅告诉我们发生了多少硬件中断,但它不一定是了解接收或处理了多少数据的好指标,因为许多驱动程序会作为与 NAPI 子系统的约定的一部分禁用 NIC IRQ。此外,使用中断合并也会影响从这个文件中收集的统计信息。监控这个文件可以帮助你确定选择的中断合并设置是否真正有效。To get a more complete picture of your network processing health, you’ll need to monitor
/proc/softirqs
(as mentioned above) and additional files in /proc
that we’ll cover below.为了更全面地了解网络处理的健康状况,你需要监控
/proc/softirqs
(如上文所述)以及下面我们将介绍的/proc
中的其他文件。Tuning network data arrival(调整网络数据到达)
Interrupt coalescing(中断合并)
Interrupt coalescing is a method of preventing interrupts from being raised by a device to a CPU until a specific amount of work or number of events are pending.
中断合并是一种防止设备向 CPU 发出中断的方法,直到有特定数量的工作或事件等待处理。
This can help prevent interrupt storms and can help increase throughput or latency, depending on the settings used. Fewer interrupts generated result in higher throughput, increased latency, and lower CPU usage. More interrupts generated result in the opposite: lower latency, lower throughput, but also increased CPU usage.
这有助于防止中断风暴,并且根据所使用的设置,还可以帮助提高吞吐量或降低延迟。生成的中断越少,吞吐量越高,延迟增加,CPU 使用率越低;生成的中断越多,情况则相反:延迟降低,吞吐量降低,但 CPU 使用率也会增加。
Historically, earlier versions of the
igb
, e1000
, and other drivers included support for a parameter called InterruptThrottleRate
. This parameter has been replaced in more recent drivers with a generic ethtool
function.从历史上看,早期版本的
igb
、e1000
和其他驱动程序包含一个名为InterruptThrottleRate
的参数支持。在更新的驱动程序中,这个参数已被一个通用的ethtool
函数取代。Get the current IRQ coalescing settings with
ethtool -c
.使用
ethtool -c
获取当前的 IRQ 合并设置:$ sudo ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 ...
ethtool
provides a generic interface for setting various coalescing settings. Keep in mind, however, that not every device or driver will support every setting. You should check your driver documentation or driver source code to determine what is, or is not, supported. As per the ethtool documentation: “Anything not implemented by the driver causes these values to be silently ignored.”ethtool
提供了一个通用接口来设置各种合并设置。然而,请记住,并非每个设备或驱动程序都支持所有设置。你应该查看驱动程序文档或驱动程序源代码,以确定哪些设置受支持,哪些不受支持。根据ethtool
文档:“任何未被驱动程序实现的设置都会被静默忽略。”One interesting option that some drivers support is “adaptive RX/TX IRQ coalescing.” This option is typically implemented in hardware. The driver usually needs to do some work to inform the NIC that this feature is enabled and some bookkeeping as well (as seen in the
igb
driver code above).一些驱动程序支持的一个有趣选项是 “自适应 RX/TX IRQ 合并”。这个选项通常在硬件中实现。驱动程序通常需要做一些工作来通知 NIC 启用此功能,并进行一些簿记工作(如上面
igb
驱动程序代码中所示)。The result of enabling adaptive RX/TX IRQ coalescing is that interrupt delivery will be adjusted to improve latency when packet rate is low and also improve throughput when packet rate is high.
启用自适应 RX/TX IRQ 合并的结果是,在数据包速率较低时,中断传递将被调整以改善延迟;在数据包速率较高时,则会提高吞吐量。
Enable adaptive RX IRQ coalescing with
ethtool -C
使用
ethtool -C
启用自适应 RX IRQ 合并:$ sudo ethtool -C eth0 adaptive-rx on
You can also use
ethtool -C
to set several options. Some of the more common options to set are:你还可以使用
ethtool -C
设置多个选项。一些比较常见的可设置选项包括:rx-usecs
: How many usecs to delay an RX interrupt after a packet arrives.
rx-frames
: Maximum number of data frames to receive before an RX interrupt.
rx-usecs-irq
: How many usecs to delay an RX interrupt while an interrupt is being serviced by the host.
rx-frames-irq
: Maximum number of data frames to receive before an RX interrupt is generated while the system is servicing an interrupt.
And many, many more.
rx-usecs
:数据包到达后,延迟 RX 中断的微秒数。
rx-frames
:在产生 RX 中断之前,接收的数据帧的最大数量。
rx-usecs-irq
:在主机处理中断时,延迟 RX 中断的微秒数。
rx-frames-irq
:在系统处理中断时,产生 RX 中断之前,接收的数据帧的最大数量。
还有很多其他选项。
Reminder that your hardware and driver may only support a subset of the options listed above. You should consult your driver source code and your hardware data sheet for more information on supported coalescing options.
请记住,你的硬件和驱动程序可能仅支持上述选项的一个子集。你应该查阅驱动程序源代码和硬件数据表,以获取有关支持的合并选项的更多信息。
Unfortunately, the options you can set aren’t well documented anywhere except in a header file. Check the source of include/uapi/linux/ethtool.h to find an explanation of each option supported by
ethtool
(but not necessarily your driver and NIC).不幸的是,除了在一个头文件中,你可以设置的选项并没有很好的文档说明。查看
include/uapi/linux/ethtool.h
的源代码,以找到ethtool
支持的每个选项的解释(但不一定适用于你的驱动程序和 NIC)。Note: while interrupt coalescing seems to be a very useful optimization at first glance, the rest of the networking stack internals also come into the fold when attempting to optimize. Interrupt coalescing can be useful in some cases, but you should ensure that the rest of your networking stack is also tuned properly. Simply modifying your coalescing settings alone will likely provide minimal benefit in and of itself.
注意:虽然乍一看中断合并似乎是一种非常有用的优化,但在尝试优化时,网络栈的其他内部机制也会起作用。中断合并在某些情况下可能有用,但你应该确保网络栈的其他部分也进行了适当的调整。仅仅修改合并设置本身可能只会带来最小的好处。
Adjusting IRQ affinities(调整 IRQ 亲和力)
If your NIC supports RSS / multiqueue or if you are attempting to optimize for data locality, you may wish to use a specific set of CPUs for handling interrupts generated by your NIC.
如果你的 NIC 支持 RSS / 多队列功能,或者你试图优化数据局部性,你可能希望使用一组特定的 CPU 来处理 NIC 生成的中断。
Setting specific CPUs allows you to segment which CPUs will be used for processing which IRQs. These changes may affect how upper layers operate, as we’ve seen for the networking stack.
设置特定的 CPU 可以让你划分哪些 CPU 将用于处理哪些 IRQ。这些更改可能会影响上层的操作,就像我们在网络栈中看到的那样。
If you do decide to adjust your IRQ affinities, you should first check if you running the
irqbalance
daemon. This daemon tries to automatically balance IRQs to CPUs and it may overwrite your settings. If you are running irqbalance
, you should either disable irqbalance
or use the --banirq
in conjunction with IRQBALANCE_BANNED_CPUS
to let irqbalance
know that it shouldn’t touch a set of IRQs and CPUs that you want to assign yourself.如果你决定调整 IRQ 亲和力,首先应该检查是否正在运行
irqbalance
守护进程。这个守护进程试图自动将 IRQ 平衡到各个 CPU 上,它可能会覆盖你的设置。如果你正在运行irqbalance
,你应该要么禁用它,要么使用--banirq
结合IRQBALANCE_BANNED_CPUS
,让irqbalance
知道不要触碰你想要自己分配的一组 IRQ 和 CPU。Next, you should check the file
/proc/interrupts
for a list of the IRQ numbers for each network RX queue for your NIC.接下来,你应该查看
/proc/interrupts
文件,获取 NIC 每个网络 RX 队列的 IRQ 编号列表。Finally, you can adjust the which CPUs each of those IRQs will be handled by modifying
/proc/irq/IRQ_NUMBER/smp_affinity
for each IRQ number.最后,你可以通过修改每个 IRQ 编号对应的
/proc/irq/IRQ_NUMBER/smp_affinity
文件,来调整每个 IRQ 将由哪些 CPU 处理。You simply write a hexadecimal bitmask to this file to instruct the kernel which CPUs it should use for handling the IRQ.
你只需向这个文件写入一个十六进制位掩码,就可以指示内核应该使用哪些 CPU 来处理该 IRQ。
Example: Set the IRQ affinity for IRQ 8 to CPU 0
示例:将 IRQ 8 的 IRQ 亲和力设置为 CPU 0
$ sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'
Network data processing begins(网络数据处理开始)
Once the softirq code determines that a softirq is pending, begins processing, and executes
net_rx_action
, network data processing begins.一旦软中断代码确定有一个软中断处于挂起状态,开始处理并执行
net_rx_action
,网络数据处理就开始了。Let’s take a look at portions of the
net_rx_action
processing loop to understand how it works, which pieces are tunable, and what can be monitored.让我们看一下
net_rx_action
处理循环的部分内容,以了解它是如何工作的、哪些部分是可调整的,以及可以监控哪些内容。net_rx_action
processing loop(net_rx_action 处理循环)
net_rx_action
begins the processing of packets from the memory the packets were DMA’d into by the device.net_rx_action
开始处理设备通过 DMA 传输到内存中的数据包。The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, and operating on it.
该函数遍历为当前 CPU 排队的 NAPI 结构列表,将每个结构出队并进行处理。
The processing loop bounds the amount of work and execution time that can be consumed by the registered NAPI
poll
functions. It does this in two ways:处理循环限制了注册的 NAPI
poll
函数可以消耗的工作量和执行时间,它通过两种方式实现这一点:- By keeping track of a work
budget
(which can be adjusted), and
- Checking the elapsed time
- 跟踪一个工作
budget
(可以调整)。
- 检查经过的时间。在
net/core/dev.c
中:
From net/core/dev.c:
while (!list_empty(&sd->poll_list)) { struct napi_struct *n; int work, weight; /* If softirq window is exhausted then punt. * Allow this to run for 2 jiffies since which will allow * an average latency of 1.5/HZ. */ if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) goto softnet_break;
This is how the kernel prevents packet processing from consuming the entire CPU. The
budget
above is the total available budget that will be spent among each of the available NAPI structures registered to this CPU.这就是内核防止数据包处理占用整个 CPU 的方式。上面的
budget
是分配给当前 CPU 上每个可用 NAPI 结构的总可用预算。This is another reason why multiqueue NICs should have the IRQ affinity carefully tuned. Recall that the CPU which handles the IRQ from the device will be the CPU where the softirq handler will execute and, as a result, will also be the CPU where the above loop and budget computation runs.
这也是为什么具有多队列的 NIC 应该仔细调整 IRQ 亲和力的另一个原因。回想一下,处理设备 IRQ 的 CPU 将是软中断处理程序执行的 CPU,因此,也是上述循环和预算计算运行的 CPU。
Systems with multiple NICs each with multiple queues can end up in a situation where multiple NAPI structs are registered to the same CPU. Data processing for all NAPI structs on the same CPU spend from the same
budget
.具有多个 NIC 且每个 NIC 都有多个队列的系统可能会出现多个 NAPI 结构注册到同一个 CPU 的情况。同一 CPU 上所有 NAPI 结构的数据处理都从相同的
budget
中消耗资源。If you don’t have enough CPUs to distribute your NIC’s IRQs, you can consider increasing the
net_rx_action
budget
to allow for more packet processing for each CPU. Increasing the budget will increase CPU usage (specifically sitime
or si
in top
or other programs), but should reduce latency as data will be processed more promptly.如果你没有足够的 CPU 来分配 NIC 的 IRQ,你可以考虑增加
net_rx_action
的budget
,以便每个 CPU 可以处理更多的数据包。增加预算会增加 CPU 使用率(特别是在top
或其他程序中的si
时间或si
字段),但应该会减少延迟,因为数据将被更及时地处理。Note: the CPU will still be bounded by a time limit of 2 jiffies, regardless of the assigned budget.
注意:无论分配的预算是多少,CPU 仍然会受到 2 个 jiffies 的时间限制。
NAPI poll
function and weight
(NAPI poll 函数和 weight)
Recall that network device drivers use
netif_napi_add
for registering poll
function. As we saw earlier in this post, the igb
driver has a piece of code like this:回想一下,网络设备驱动程序使用
netif_napi_add
注册poll
函数。正如我们在本文前面看到的,igb
驱动程序有类似这样的代码:/* initialize NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
This registers a NAPI structure with a hardcoded weight of 64. We’ll see now how this is used in the
net_rx_action
processing loop.这会注册一个 NAPI 结构,其权重被硬编码为 64。我们现在将看看这个权重在
net_rx_action
处理循环中是如何使用的。在net/core/dev.c
中:From net/core/dev.c:
weight = n->weight; work = 0; if (test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } WARN_ON_ONCE(work > weight); budget -= work;
This code obtains the weight which was registered to the NAPI struct (
64
in the above driver code) and passes it into the poll
function which was also registered to the NAPI struct (igb_poll
in the above code).这段代码获取注册到 NAPI 结构的权重(在上面的驱动程序代码中为 64),并将其传递给也注册到该 NAPI 结构的
poll
函数(在上面的代码中为igb_poll
)。The
poll
function returns the number of data frames that were processed. This amount is saved above as work
, which is then subtracted from the overall budget
.poll
函数返回处理的数据帧数量,这个数量在上面保存为work
,然后从总budget
中减去。So, assuming:
- You are using a weight of
64
from your driver (all drivers were hardcoded with this value in Linux 3.13.0), and
- You have your
budget
set to the default of300
Your system would stop processing data when either:
- The
igb_poll
function was called at most 5 times (less if no data to process as we’ll see next), OR
- At least 2 jiffies of time have elapsed.
所以,假设:
- 你使用的驱动程序中的权重为 64(在 Linux 3.13.0 中,所有驱动程序都将此值硬编码为此)。
- 你的
budget
设置为默认的 300。
你的系统将在以下两种情况之一停止处理数据:
igb_poll
函数最多被调用 5 次(如果没有数据要处理,调用次数会更少,我们接下来会看到)。
- 至少经过了 2 个 jiffies 的时间。
The NAPI / network device driver contract(NAPI / 网络设备驱动程序约定)
One important piece of information about the contract between the NAPI subsystem and device drivers which has not been mentioned yet are the requirements around shutting down NAPI.
关于 NAPI 子系统和设备驱动程序之间的约定,有一个重要信息之前尚未提及,那就是关于关闭 NAPI 的要求。
This part of the contract is as follows:
这个约定的这部分内容如下:
- If a driver’s
poll
function consumes its entire weight (which is hardcoded to64
) it must NOT modify NAPI state. Thenet_rx_action
loop will take over.
- If a driver’s
poll
function does NOT consume its entire weight, it must disable NAPI. NAPI will be re-enabled next time an IRQ is received and the driver’s IRQ handler callsnapi_schedule
.
- 如果驱动程序的
poll
函数消耗了其全部权重(硬编码为 64),它不能修改 NAPI 状态,net_rx_action
循环将接管。
- 如果驱动程序的
poll
函数没有消耗其全部权重,它必须禁用 NAPI。下次接收到 IRQ 并且驱动程序的 IRQ 处理程序调用napi_schedule
时,NAPI 将重新启用。
We’ll see how
net_rx_action
deals with the first part of that contract now. Next, the poll
function is examined, we’ll see how the second part of that contract is handled.我们现在将看看
net_rx_action
如何处理该约定的第一部分。接下来,在检查poll
函数时,我们将看到该约定的第二部分是如何处理的。Finishing the net_rx_action
loop(完成 net_rx_action 循环)
The
net_rx_action
processing loop finishes up with one last section of code that deals with the first part of the NAPI contract explained in the previous section. From net/core/dev.c:net_rx_action
处理循环以最后一段代码结束,这段代码处理上一节中解释的 NAPI 约定的第一部分。在net/core/dev.c
中:/* Drivers must not modify the NAPI state if they * consume the entire weight. In such cases this code * still "owns" the NAPI instance and therefore can * move the instance around on the list at-will. */ if (unlikely(work == weight)) { if (unlikely(napi_disable_pending(n))) { local_irq_enable(); napi_complete(n); local_irq_disable(); } else { if (n->gro_list) { /* flush too old packets * If HZ < 1000, flush all packets. */ local_irq_enable(); napi_gro_flush(n, HZ >= 1000); local_irq_disable(); } list_move_tail(&n->poll_list, &sd->poll_list); } }
If the entire work is consumed, there are two cases that
net_rx_action
handles:如果工作全部完成,
net_rx_action
会处理两种情况:- The network device should be shutdown (e.g. because the user ran
ifconfig eth0 down
), - 网络设备应该关闭(例如,因为用户运行了
ifconfig eth0 down
命令)。
- If the device is not being shutdown, check if there’s a generic receive offload (GRO) list. If the timer tick rate is >= 1000, all GRO’d network flows that were recently updated will be flushed. We’ll dig into GRO in detail later. Move the NAPI structure to the end of the list for this CPU so the next iteration of the loop will get the next NAPI structure registered.
- 如果设备没有关闭,检查是否有通用接收卸载(GRO)列表。如果定时器滴答率
<1000
,所有最近更新的GRO网络流将被刷新。我们稍后会深入研究GRO。将NAPI结构移动到该CPU列表的末尾,以便循环的下一次迭代可以获取注册的下一个NAPI结构。
And that is how the packet processing loop invokes the driver’s registered
poll
function to process packets. As we’ll see shortly, the poll
function will harvest network data and send it up the stack to be processed.这就是数据包处理循环调用驱动程序注册的
poll
函数来处理数据包的方式。正如我们稍后将看到的,poll
函数将收集网络数据并将其发送到栈中进行进一步处理。Exiting the loop when limits are reached(达到限制时退出循环)
The
net_rx_action
loop will exit when either:net_rx_action
循环将在以下情况之一退出:- The poll list registered for this CPU has no more NAPI structures (
!list_empty(&sd->poll_list)
), or - 为该CPU注册的轮询列表中没有更多NAPI结构(
!list_empty(&sd->poll_list)
)。
- The remaining budget is <= 0, or
- 剩余预算
<= 0
。
- The time limit of 2 jiffies has been reached
- 达到 2 个 jiffies 的时间限制。
Here’s this code we saw earlier again:
这是我们之前看到的代码:
/* If softirq window is exhausted then punt. * Allow this to run for 2 jiffies since which will allow * an average latency of 1.5/HZ. */ if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) goto softnet_break;
If you follow the
softnet_break
label you stumble upon something interesting. From net/core/dev.c:如果跟随
softnet_break
标签,你会发现一些有趣的事情。在net/core/dev.c
中:softnet_break: sd->time_squeeze++; __raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out;
The
struct softnet_data
structure has some statistics incremented and the softirq NET_RX_SOFTIRQ
is shut down. The time_squeeze
field is a measure of the number of times net_rx_action
had more work to do but either the budget was exhausted or the time limit was reached before it could be completed. This is a tremendously useful counter for understanding bottlenecks in network processing. We’ll see shortly how to monitor this value. The NET_RX_SOFTIRQ
is disabled to free up processing time for other tasks. This makes sense as this small stub of code is only executed when more work could have been done, but we don’t want to monopolize the CPU.struct softnet_data
结构的一些统计信息会增加,并且NET_RX_SOFTIRQ
软中断会被关闭。time_squeeze
字段用于衡量net_rx_action
有更多工作要做,但在完成之前预算耗尽或达到时间限制的次数。这是一个非常有用的计数器,用于了解网络处理中的瓶颈。我们稍后将看到如何监控这个值。NET_RX_SOFTIRQ
被禁用,以便为其他任务释放处理时间。这是有意义的,因为只有在还有更多工作要做,但我们又不想独占 CPU 的情况下,才会执行这个小代码段。Execution is then transferred to the
out
label. Execution can also make it to the out
label if there were no more NAPI structures to process, in other words, there is more budget than there is network activity and all the drivers have shut NAPI off and there is nothing left for net_rx_action
to do.然后执行转移到
out
标签。如果没有更多的 NAPI 结构要处理,即预算比网络活动多,并且所有驱动程序都已关闭 NAPI,net_rx_action
没有其他事情可做,执行也会到达out
标签。The
out
section does one important thing before returning from net_rx_action
: it calls net_rps_action_and_irq_enable
. This function serves an important purpose if Receive Packet Steering is enabled; it wakes up remote CPUs to start processing network data.out
部分在从net_rx_action
返回之前做了一件重要的事情:它调用net_rps_action_and_irq_enable
。如果启用了接收数据包导向(Receive Packet Steering),这个函数将唤醒远程 CPU 以开始处理网络数据。We’ll see more about how RPS works later. For now, let’s see how to monitor the health of the
net_rx_action
processing loop and move on to the inner working of NAPI poll
functions so we can progress up the network stack.稍后我们将更详细地了解RPS是如何工作的。现在,让我们先看看如何监控
net_rx_action
处理循环的健康状况,然后深入探讨 NAPI poll 函数的内部工作原理,以便继续向上推进网络协议栈。NAPI poll
Recall in previous sections that device drivers allocate a region of memory for the device to perform DMA to incoming packets. Just as it is the responsibility of the driver to allocate those regions, it is also the responsibility of the driver to unmap those regions, harvest the data, and send it up the network stack.
回想一下前面的章节,设备驱动程序为设备分配一块内存区域,以便设备对传入的数据包执行 DMA 操作。正如分配这些区域是驱动程序的责任一样,取消映射这些区域、收集数据并将其发送到网络栈也是驱动程序的责任。
Let’s take a look at how the
igb
driver does this to get an idea of how this works in practice.让我们看看
igb
驱动程序是如何做到这一点的,以便了解实际情况。igb_poll
At long last, we can finally examine our friend
igb_poll
. It turns out the code for igb_poll
is deceptively simple. Let’s take a look. From drivers/net/ethernet/intel/igb/igb_main.c:终于,我们可以研究一下
igb_poll
函数了。事实证明,igb_poll
的代码看似简单,实则不然。让我们来看看。在drivers/net/ethernet/intel/igb/igb_main.c
中:/** * igb_poll - NAPI Rx polling callback, NAPI Rx轮询回调函数 * @napi: napi polling structure, napi轮询结构 * @budget: count of how many packets we should handle, 我们应该处理的数据包数量 **/ static int igb_poll(struct napi_struct *napi, int budget) { struct igb_q_vector *q_vector = container_of(napi, struct igb_q_vector, napi); bool clean_complete = true; #ifdef CONFIG_IGB_DCA if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED) igb_update_dca(q_vector); #endif /* ... */ if (q_vector->rx.ring) clean_complete &= igb_clean_rx_irq(q_vector, budget); /* If all work not completed, return budget and keep polling */ if (!clean_complete) return budget; /* If not enough Rx work done, exit the polling mode */ napi_complete(napi); igb_ring_irq_enable(q_vector); return 0; }
This code does a few interesting things:
这段代码做了几件有趣的事情:
- If Direct Cache Access (DCA) support is enabled in the kernel, the CPU cache is warmed so that accesses to the RX ring will hit CPU cache. You can read more about DCA in the Extras section at the end of this blog post.
- Next,
igb_clean_rx_irq
is called which does the heavy lifting, as we’ll see next.
- Next,
clean_complete
is checked to determine if there was still more work that could have been done. If so, thebudget
(remember, this was hardcoded to64
) is returned. As we saw earlier,net_rx_action
will move this NAPI structure to the end of the poll list.
- Otherwise, the driver turns off NAPI by calling
napi_complete
and re-enables interrupts by callingigb_ring_irq_enable
. The next interrupt that arrives will re-enable NAPI.
- 如果内核中启用了直接缓存访问(DCA)支持,CPU 缓存将被预热,以便对 RX 环的访问能够命中 CPU 缓存。你可以在本文末尾的 “其他内容” 部分中阅读更多关于 DCA 的信息。
- 接下来,调用
igb_clean_rx_irq
函数,它将完成主要的工作,我们接下来会看到。
- 然后,检查
clean_complete
以确定是否还有更多工作可以做。如果是这样,返回budget
(请记住,在上面的代码中,它被硬编码为 64)。正如我们之前看到的,net_rx_action
将把这个 NAPI 结构移动到轮询列表的末尾。
- 否则,驱动程序通过调用
napi_complete
关闭 NAPI,并通过调用igb_ring_irq_enable
重新启用中断。下一个到达的中断将重新启用 NAPI。
Let’s see how
igb_clean_rx_irq
sends network data up the stack.让我们看看
igb_clean_rx_irq
是如何将网络数据发送到栈中的。igb_clean_rx_irq
The
igb_clean_rx_irq
function is a loop which processes one packet at a time until the budget
is reached or no additional data is left to process.igb_clean_rx_irq
函数是一个循环,它一次处理一个数据包,直到达到budget
限制或没有更多数据可处理。The loop in this function does a few important things:
这个函数中的循环做了几件重要的事情:
- Allocates additional buffers for receiving data as used buffers are cleaned out. Additional buffers are added
IGB_RX_BUFFER_WRITE
(16) at a time.
- Fetch a buffer from the RX queue and store it in an
skb
structure.
- Check if the buffer is an “End of Packet” buffer. If so, continue processing. Otherwise, continue fetching additional buffers from the RX queue, adding them to the
skb
. This is necessary if a received data frame is larger than the buffer size.
- Verify that the layout and headers of the data are correct.
- The number of bytes processed statistic counter is increased by
skb->len
.
- Set the hash, checksum, timestamp, VLAN id, and protocol fields of the skb. The hash, checksum, timestamp, and VLAN id are provided by the hardware. If the hardware is signaling a checksum error, the
csum_error
statistic is incremented. If the checksum succeeded and the data is UDP or TCP data, theskb
is marked asCHECKSUM_UNNECESSARY
. If the checksum failed, the protocol stacks are left to deal with this packet. The protocol is computed with a call toeth_type_trans
and stored in theskb
struct.
- The constructed
skb
is handed up the network stack with a call tonapi_gro_receive
.
- The number of packets processed statistics counter is incremented.
- The loop continues until the number of packets processed reaches the budget.
- 当清理已使用的缓冲区时,为接收数据分配额外的缓冲区。每次以
IGB_RX_BUFFER_WRITE
(16)为单位添加额外的缓冲区。
- 从 RX 队列中获取一个缓冲区,并将其存储在一个
skb
结构中。
- 检查该缓冲区是否为 “数据包结束” 缓冲区。如果是,则继续处理;否则,继续从 RX 队列中获取更多缓冲区,并将它们添加到
skb
中。如果接收到的数据帧大于缓冲区大小,这是必要的操作。
- 验证数据的布局和头部是否正确。
- 将处理的字节数统计计数器增加
skb->len
。
- 设置
skb
的哈希、校验和、时间戳、VLAN ID 和协议字段。哈希、校验和、时间戳和 VLAN ID 由硬件提供。如果硬件检测到校验和错误,csum_error
统计信息将增加。如果校验和成功,并且数据是 UDP 或 TCP 数据,则将skb
标记为CHECKSUM_UNNECESSARY
。如果校验和失败,协议栈将负责处理这个数据包。通过调用eth_type_trans
计算协议,并将其存储在skb
结构中。
- 通过调用
napi_gro_receive
将构造好的skb
传递到网络栈中。
- 增加处理的数据包数量统计计数器。
- 循环继续,直到处理的数据包数量达到预算。
Once the loop terminates, the function assigns statistics counters for rx packets and bytes processed.
一旦循环终止,该函数会为接收的数据包和处理的字节数分配统计计数器。
Now it’s time to take two detours prior to proceeding up the network stack. First, let’s see how to monitor and tune the network subsystem’s softirqs. Next, let’s talk about Generic Receive Offloading (GRO). After that, the rest of the networking stack will make more sense as we enter
napi_gro_receive
.现在,在继续向上研究网络栈之前,我们需要进行两个小插曲。首先,让我们看看如何监控和调整网络子系统的软中断。接下来,让我们讨论一下通用接收卸载(Generic Receive Offloading,GRO)。之后,当我们进入
napi_gro_receive
时,网络栈的其余部分会更容易理解。Monitoring network data processing(监控网络数据处理)
/proc/net/softnet_stat
As seen in the previous section,
net_rx_action
increments a statistic when exiting the net_rx_action
loop and when additional work could have been done, but either the budget
or the time limit for the softirq was hit. This statistic is tracked as part of the struct softnet_data
associated with the CPU.如前一节所述,当
net_rx_action
退出循环,并且还有更多工作可以做,但budget
或软中断的时间限制已达到时,它会增加一个统计信息。这个统计信息作为与 CPU 相关联的struct softnet_data
的一部分进行跟踪。These statistics are output to a file in proc:
/proc/net/softnet_stat
for which there is, unfortunately, very little documentation. The fields in the file in proc are not labeled and could change between kernel releases.这些统计信息输出到
proc
文件系统中的一个文件:/proc/net/softnet_stat
,遗憾的是,关于这个文件的文档很少。proc
文件中的字段没有标记,并且可能在不同的内核版本之间发生变化。In Linux 3.13.0, you can find which values map to which field in
/proc/net/softnet_stat
by reading the kernel source. From net/core/net-procfs.c:在 Linux 3.13.0 中,你可以通过阅读内核源代码来确定
/proc/net/softnet_stat
中哪些值对应哪些字段。在net/core/net-procfs.c
中:seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", sd->processed, sd->dropped, sd->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ sd->cpu_collision, sd->received_rps, flow_limit_count);
Many of these statistics have confusing names and are incremented in places where you might not expect. An explanation of when and where each of these is incremented will be provided as the network stack is examined. Since the
squeeze_time
statistic was seen in net_rx_action
, I thought it made sense to document this file now.这些统计信息中的许多名称都容易引起混淆,并且在你可能想不到的地方增加。在研究网络栈时,将提供每个统计信息何时何地增加的解释。由于在
net_rx_action
中看到了squeeze_time
统计信息,我认为现在记录这个文件是有意义的。Monitor network data processing statistics by reading
/proc/net/softnet_stat
.通过读取
/proc/net/softnet_stat
监控网络数据处理统计信息:$ cat /proc/net/softnet_stat 6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6f0e1565 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000 660774ec 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000 61c99331 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6794b1b3 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6488cb92 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Important details about
/proc/net/softnet_stat
:关于
/proc/net/softnet_stat
的重要细节:- Each line of
/proc/net/softnet_stat
corresponds to astruct softnet_data
structure, of which there is 1 per CPU.
- The values are separated by a single space and are displayed in hexadecimal
- The first value,
sd->processed
, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment thesd->processed
count more than once for the same packet.
- The second value,
sd->dropped
, is the number of network frames dropped because there was no room on the processing queue. More on this later.
- The third value,
sd->time_squeeze
, is (as we saw) the number of times thenet_rx_action
loop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing thebudget
as explained earlier can help reduce this.
- The next 5 values are always 0.
- The ninth value,
sd->cpu_collision
, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below.
- The tenth value,
sd->received_rps
, is a count of the number of times this CPU has been woken up to process packets via an Inter-processor Interrupt
- The last value,
flow_limit_count
, is a count of the number of times the flow limit has been reached. Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.
/proc/net/softnet_stat
的每一行对应一个struct softnet_data
结构,每个 CPU 有一个这样的结构。
- 值之间用单个空格分隔,并以十六进制显示。
- 第一个值
sd->processed
是处理的网络帧数。如果你使用以太网绑定,这个值可能会大于接收的网络帧总数。在某些情况下,以太网绑定驱动程序会触发网络数据重新处理,这会使sd->processed
计数对同一个数据包增加多次。
- 第二个值
sd->dropped
是由于处理队列中没有空间而丢弃的网络帧数。稍后会详细介绍。
- 第三个值
sd->time_squeeze
(如我们所见)是net_rx_action
循环因预算耗尽或达到时间限制而终止,但还有更多工作可做的次数。如前所述,增加budget
可以帮助减少这个值。
- 接下来的 5 个值始终为 0。
- 第九个值
sd->cpu_collision
是在传输数据包时尝试获取设备锁时发生冲突的次数。本文是关于接收的,所以下面不会看到这个统计信息。
- 第十个值
sd->received_rps
是这个 CPU 通过处理器间中断被唤醒以处理数据包的次数。
- 最后一个值
flow_limit_count
是达到流量限制的次数。流量限制是接收数据包导向的一个可选功能,稍后将进行研究。
If you decide to monitor this file and graph the results, you must be extremely careful that the ordering of these fields hasn’t changed and that the meaning of each field has been preserved. You will need to read the kernel source to verify this.
如果你决定监控这个文件并绘制结果图表,必须非常小心,确保这些字段的顺序没有改变,并且每个字段的含义保持不变。你需要阅读内核源代码来验证这一点。
Tuning network data processing(调整网络数据处理)
Adjusting the
net_rx_action
budget(调整 net_rx_action 预算)You can adjust the
net_rx_action
budget, which determines how much packet processing can be spent among all NAPI structures registered to a CPU by setting a sysctl value named net.core.netdev_budget
.你可以通过设置一个名为
net.core.netdev_budget
的 sysctl 值来调整net_rx_action
预算,该预算决定了在注册到一个 CPU 的所有 NAPI 结构之间可以花费多少数据包处理资源。Example: set the overall packet processing budget to 600.
示例:将整体数据包处理预算设置为 600
$ sudo sysctl -w net.core.netdev_budget=600
You may also want to write this setting to your
/etc/sysctl.conf
file so that changes persist between reboots.你可能还想将这个设置写入
/etc/sysctl.conf
文件,以便更改在重启后仍然有效。The default value on Linux 3.13.0 is 300.
在 Linux 3.13.0 中,默认值是 300。
Generic Receive Offloading (GRO)(通用接收卸载(GRO))
Generic Receive Offloading (GRO) is a software implementation of a hardware optimization that is known as Large Receive Offloading (LRO).
通用接收卸载(Generic Receive Offloading,GRO)是一种硬件优化(称为大接收卸载,Large Receive Offloading,LRO)的软件实现。
The main idea behind both methods is that reducing the number of packets passed up the network stack by combining “similar enough” packets together can reduce CPU usage. For example, imagine a case where a large file transfer is occurring and most of the packets contain chunks of data in the file. Instead of sending small packets up the stack one at a time, the incoming packets can be combined into one packet with a huge payload. That packet can then be passed up the stack. This allows the protocol layers to process a single packet’s headers while delivering bigger chunks of data to the user program.
这两种方法的主要思想是,通过将 “足够相似” 的数据包合并在一起,减少传递到网络栈的数据包数量,从而降低 CPU 使用率。例如,想象一个大文件传输的场景,大多数数据包包含文件中的数据块。与其一次将小数据包逐个发送到栈中,不如将传入的数据包合并成一个带有巨大有效负载的数据包,然后将这个数据包传递到栈中。这使得协议层可以处理单个数据包的头部,同时将更大的数据块传递给用户程序。
The problem with this sort of optimization is, of course, information loss. If a packet had some important option or flag set, that option or flag could be lost if the packet is coalesced into another. And this is exactly why most people don’t use or encourage the use of LRO. LRO implementations, generally speaking, had very lax rules for coalescing packets.
当然,这种优化的问题在于信息丢失。如果一个数据包设置了一些重要的选项或标志,当它被合并到另一个数据包中时,这些选项或标志可能会丢失。这正是为什么大多数人不使用或不鼓励使用 LRO 的原因。一般来说,LRO 实现对于合并数据包的规则非常宽松。
GRO was introduced as an implementation of LRO in software, but with more strict rules around which packets can be coalesced.
GRO 作为 LRO 的软件实现被引入,但对于哪些数据包可以合并有更严格的规则。
By the way: if you have ever used
tcpdump
and seen unrealistically large incoming packet sizes, it is most likely because your system has GRO enabled. As you’ll see soon, packet capture taps are inserted further up the stack, after GRO has already happened.顺便说一下:如果你曾经使用过
tcpdump
,并且看到不切实际的大传入数据包大小,很可能是因为你的系统启用了 GRO。正如你很快就会看到的,数据包捕获点是在网络栈中更靠上的位置插入的,在 GRO 已经发生之后。Tuning: Adjusting GRO settings with ethtool
(调整:使用 ethtool 调整 GRO 设置)
You can use
ethtool
to check if GRO is enabled and also to adjust the setting.你可以使用
ethtool
检查 GRO 是否启用,也可以调整这个设置。Use
ethtool -k
to check your GRO settings.使用
ethtool -k
检查你的 GRO 设置:$ ethtool -k eth0 | grep generic-receive-offload generic-receive-offload: on
As you can see, on this system I have
generic-receive-offload
set to on.如你所见,在这个系统上,我将
generic-receive-offload
设置为开启。Use
ethtool -K
to enable (or disable) GRO.使用
ethtool -K
启用(或禁用)GRO:$ sudo ethtool -K eth0 gro on
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
注意:对于大多数驱动程序,进行这些更改会使网络接口先关闭再重新启动,与该接口的连接将被中断。不过,对于一次性更改而言,这可能影响不大。
napi_gro_receive
The function
napi_gro_receive
deals processing network data for GRO (if GRO is enabled for the system) and sending the data up the stack toward the protocol layers. Much of this logic is handled in a function called dev_gro_receive
.napi_gro_receive
函数负责处理网络数据的 GRO(如果系统启用了 GRO),并将数据发送到栈中,朝着协议层传递。大部分逻辑在一个名为dev_gro_receive
的函数中处理。dev_gro_receive
This function begins by checking if GRO is enabled and, if so, preparing to do GRO. In the case where GRO is enabled, a list of GRO offload filters is traversed to allow the higher level protocol stacks to act on a piece of data which is being considered for GRO. This is done so that the protocol layers can let the network device layer know if this packet is part of a network flow that is currently being receive offloaded and handle anything protocol specific that should happen for GRO. For example, the TCP protocol will need to decide if/when to ACK a packet that is being coalesced into an existing packet.
这个函数首先检查 GRO 是否启用,如果启用,则准备进行 GRO 操作。在 GRO 启用的情况下,会遍历一组 GRO 卸载过滤器,以便更高层的协议栈可以对正在考虑进行 GRO 的数据进行操作。这样做是为了让协议层能够告知网络设备层这个数据包是否属于当前正在进行接收卸载的网络流,并处理 GRO 所需的任何特定于协议的操作。例如,TCP 协议需要决定是否以及何时对正在合并到现有数据包中的数据包进行 ACK 响应。
Here’s the code from
net/core/dev.c
which does this:在
net/core/dev.c
中的代码如下:list_for_each_entry_rcu(ptype, head, list) { if (ptype->type != type || !ptype->callbacks.gro_receive) continue; skb_set_network_header(skb, skb_gro_offset(skb)); skb_reset_mac_len(skb); NAPI_GRO_CB(skb)->same_flow = 0; NAPI_GRO_CB(skb)->flush = 0; NAPI_GRO_CB(skb)->free = 0; pp = ptype->callbacks.gro_receive(&napi->gro_list, skb); break; }
If the protocol layers indicated that it is time to flush the GRO’d packet, that is taken care of next. This happens with a call to
napi_gro_complete
, which calls a gro_complete
callback for the protocol layers and then passes the packet up the stack by calling netif_receive_skb
.如果协议层表示是时候刷新 GRO 数据包了,接下来就会进行刷新。这需要调用 napi_gro_complete,它调用协议层的 gro_complete 回调,然后通过调用 netif_receive_skb,将数据包传递到堆栈。
Here’s the code from
net/core/dev.c
which does this:if (pp) { struct sk_buff *nskb = *pp; *pp = nskb->next; nskb->next = NULL; napi_gro_complete(nskb); napi->gro_count--; }
Next, if the protocol layers merged this packet to an existing flow,
napi_gro_receive
simply returns as there’s nothing else to do.接下来,如果协议层将该数据包合并到了一个已有的数据流中,napi_gro_receive 就会返回,因为没有其他事情可做了。
If the packet was not merged and there are fewer than
MAX_GRO_SKBS
(8) GRO flows on the system, a new entry is added to the gro_list
on the NAPI structure for this CPU.如果数据包未被合并,且系统中的 GRO 流量少于 MAX_GRO_SKBS(8),则在该 CPU 的 NAPI 结构 gro_list 中添加一个新条目。
Here’s the code from
net/core/dev.c
which does this:if (NAPI_GRO_CB(skb)->flush || napi->gro_count >= MAX_GRO_SKBS) goto normal; napi->gro_count++; NAPI_GRO_CB(skb)->count = 1; NAPI_GRO_CB(skb)->age = jiffies; skb_shinfo(skb)->gso_size = skb_gro_len(skb); skb->next = napi->gro_list; napi->gro_list = skb; ret = GRO_HELD;
And that is how the GRO system in the Linux networking stack works.
这就是 Linux 网络协议栈中 GRO 系统的工作原理。
napi_skb_finish
Once
dev_gro_receive
completes, napi_skb_finish
is called which either frees unneeded data structures because a packet has been merged, or calls netif_receive_skb
to pass the data up the network stack (because there were already MAX_GRO_SKBS
flows being GRO’d).当 dev_gro_receive 执行完毕后,会调用 napi_skb_finish,该函数要么释放不需要的数据结构(因为数据包已经被合并),要么调用 netif_receive_skb 将数据传递给网络协议栈(因为已经有 MAX_GRO_SKBS 条流正在进行 GRO 处理)。
Next, it’s time for
netif_receive_skb
to see how data is handed off to the protocol layers. Before this can be examined, we’ll need to take a look at Receive Packet Steering (RPS) first.接下来,netif_receive_skb 将介绍如何将数据移交给协议层。在检查之前,我们需要先了解一下接收数据包转向(RPS)。
Receive Packet Steering (RPS)(接收数据包导向(RPS))
Recall earlier how we discussed that network device drivers register a NAPI
poll
function. Each NAPI
poller instance is executed in the context of a softirq of which there is one per CPU. Further recall that the CPU which the driver’s IRQ handler runs on will wake its softirq processing loop to process packets.回想一下我们之前讨论过网络设备驱动程序注册一个 NAPI 轮询函数。每个 NAPI 轮询器实例都在软中断(softirq)的上下文中执行,每个 CPU 都有一个软中断。此外,驱动程序的 IRQ 处理程序运行的 CPU 会唤醒其软中断处理循环来处理数据包。
In other words: a single CPU processes the hardware interrupt and polls for packets to process incoming data.
换句话说:单个CPU处理硬件中断,并轮询数据包以处理传入的数据。
Some NICs (like the Intel I350) support multiple queues at the hardware level. This means incoming packets can be DMA’d to a separate memory region for each queue, with a separate NAPI structure to manage polling this region, as well. Thus multiple CPUs will handle interrupts from the device and also process packets.
一些网卡(如英特尔I350)在硬件层面支持多个队列。这意味着传入的数据包可以被DMA到每个队列的独立内存区域,并且有单独的NAPI结构来管理对这个区域的轮询。因此,多个CPU将处理来自设备的中断并同时处理数据包。
This feature is typically called Receive Side Scaling (RSS).
此功能通常称为接收端扩展(RSS)。
Receive Packet Steering (RPS) is a software implementation of RSS. Since it is implemented in software, this means it can be enabled for any NIC, even NICs which have only a single RX queue. However, since it is in software, this means that RPS can only enter into the flow after a packet has been harvested from the DMA memory region.
接收数据包转向(RPS)是RSS的一种软件实现。由于它是通过软件实现的,这意味着它可以为任何网卡启用,即使是一些只有单个RX队列的网卡。然而,由于它是软件实现的,这意味着RPS只能在数据包从DMA内存区域中获取后进入处理流程。
This means that you wouldn’t notice a decrease in CPU time spent handling IRQs or the NAPI
poll
loop, but you can distribute the load for processing the packet after it’s been harvested and reduce CPU time from there up the network stack.这意味着您不会注意到处理IRQ或NAPI轮询循环所花费的CPU时间减少,但您可以分配数据包处理后的负载,从而减少网络堆栈上层的CPU时间消耗。
RPS works by generating a hash for incoming data to determine which CPU should process the data. The data is then enqueued to the per-CPU receive network backlog to be processed. An Inter-processor Interrupt (IPI) is delivered to the CPU owning the backlog. This helps to kick-start backlog processing if it is not currently processing data on the backlog. The
/proc/net/softnet_stat
contains a count of the number of times each softnet_data
struct has received an IPI (the received_rps
field).RPS 通过为传入数据生成哈希值来确定应由哪个 CPU 处理数据。然后,数据被排队到每个 CPU 的接收网络积压队列中进行处理。一个跨处理器中断(IPI)会被发送到拥有该积压队列的 CPU。如果当前 CPU 不在处理积压队列中的数据,则这有助于启动积压队列的处理过程。
/proc/net/softnet_stat
包含每个 softnet_data
结构接收到 IPI
的次数计数( received_rps
字段)。Thus,
netif_receive_skb
will either continue sending network data up the networking stack, or hand it over to RPS for processing on a different CPU.因此,
netif_receive_skb
要么继续将网络数据向上发送到网络堆栈,要么将其交给 RPS
在另一个 CPU 上进行处理。Tuning: Enabling RPS(调优:启用 RPS)
For RPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should process packets for a given interface and RX queue.
要使 RPS 起作用,必须在内核配置中启用它(在 Ubuntu 的 3.13.0 内核上已启用),并设置一个位掩码,以描述哪些 CPU 应为给定接口和 RX 队列处理数据包。
You can find some documentation about these bitmasks in the kernel documentation.
您可以在内核文档中找到有关这些位掩码的一些文档。
In short, the bitmasks to modify are found in:
/sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus
总之,要修改的位掩码位于:
/sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus
So, for
eth0
and receive queue 0, you would modify the file: /sys/class/net/eth0/queues/rx-0/rps_cpus
with a hexadecimal number indicating which CPUs should process packets from eth0
’s receive queue 0. As the documentation points out, RPS may be unnecessary in certain configurations.因此,对于 eth0 和接收队列 0,您需要修改文件:
/sys/class/net/eth0/queues/rx0/rps_cpus
,使用十六进制数来指示哪些 CPU 应处理来自 eth0 接收队列 0 的数据包。正如文档所指出的,在某些配置中,RPS可能是不必要的。Note: enabling RPS to distribute packet processing to CPUs which were previously not processing packets will cause the number of `NET_RX` softirqs to increase for that CPU, as well as the `si` or `sitime` in the CPU usage graph. You can compare before and after of your softirq and CPU usage graphs to confirm that RPS is configured properly to your liking.
注意:启用RPS将数据包处理分配到之前未处理数据包的CPU上,会导致该CPU上的
NET_RX
软中断数量增加,以及CPU使用图中的si
或sitime
增加。您可以比较启用前后您的软中断和CPU使用图,以确认RPS已正确配置为您的需求。Receive Flow Steering (RFS)(接收流转向(RFS))
Receive flow steering (RFS) is used in conjunction with RPS. RPS attempts to distribute incoming packet load amongst multiple CPUs, but does not take into account any data locality issues for maximizing CPU cache hit rates. You can use RFS to help increase cache hit rates by directing packets for the same flow to the same CPU for processing.
接收流量转向(RFS)与RPS结合使用。RPS尝试在多个CPU之间分配传入的数据包负载,但并未考虑任何数据局部性问题以最大化CPU缓存命中率。您可以使用RFS来帮助提高缓存命中率,通过将相同流的数据包定向到同一CPU进行处理。
Tuning: Enabling RFS(调优:启用RFS )
For RFS to work, you must have RPS enabled and configured.
要使RFS正常工作,您必须启用并配置RPS。
RFS keeps track of a global hash table of all flows and the size of this hash table can be adjusted by setting the
net.core.rps_sock_flow_entries
sysctl.RFS 会跟踪所有流的全局哈希表,该哈希表的大小可以通过设置
net.core.rps_sock_flow_entries
sysctl 进行调整。Increase the size of the RFS socket flow hash by setting a
sysctl
.通过设置sysctl 来增加 RFS 套接字流哈希表的大小。
$ sudo sysctl -w net.core.rps_sock_flow_entries=32768
Next, you can also set the number of flows per RX queue by writing this value to the sysfs file named
rps_flow_cnt
for each RX queue.接下来,您还可以通过将此值写入名为
rps_flow_cnt
的 sysfs
文件来设置每个 RX 队列的流数。Example: increase the number of flows for RX queue 0 on eth0 to 2048.
例如:将 eth0 上 RX 队列 0 的流数增加到 2048。
$ sudo bash -c 'echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt'
Hardware accelerated Receive Flow Steering (aRFS)硬件加速接收流转向(aRFS)
RFS can be sped up with the use of hardware acceleration; the NIC and the kernel can work together to determine which flows should be processed on which CPUs. To use this feature, it must be supported by the NIC and your driver.
RFS 可以通过使用硬件加速来提高速度;网卡和内核可以协同工作,确定哪些流应该在哪些 CPU 上处理。要使用此功能,必须得到网卡及其驱动程序的支持。请查阅您的网卡数据手册,以确定是否支持此功能。
Consult your NIC’s data sheet to determine if this feature is supported. If your NIC’s driver exposes a function called
ndo_rx_flow_steer
, then the driver has support for accelerated RFS.如果您的网卡驱动程序公开了一个名为
ndo_rx_flow_steer
的函数,则该驱动程序支持加速的 RFS。Tuning: Enabling accelerated RFS (aRFS)调优:启用加速 RFS (aRFS)
Assuming that your NIC and driver support it, you can enable accelerated RFS by enabling and configuring a set of things:
假设您的网卡和驱动程序支持此功能,您可以通过启用并配置以下内容来启用加速 RFS:
- Have RFS enabled and configured.
- Your kernel has
CONFIG_RFS_ACCEL
enabled at compile time. The Ubuntu kernel 3.13.0 does.
- Have ntuple support enabled for the device, as described previously. You can use
ethtool
to verify that ntuple support is enabled for the device.
- Configure your IRQ settings to ensure each RX queue is handled by one of your desired network processing CPUs.
- 启用并配置 RPS。
- 内核在编译时已启用
CONFIG_RFS_ACCEL
。Ubuntu 内核 3.13.0 已启用此选项。
- 为设备启用 ntuple 支持,如前所述。您可以使用 ethtool 验证设备是否已启用 ntuple 支持。
- 配置 IRQ 设置,以确保每个 RX 队列由您所需的网络处理 CPU 中的一个处理。
Once the above is configured, accelerated RFS will be used to automatically move data to the RX queue tied to a CPU core that is processing data for that flow and you won’t need to specify an ntuple filter rule manually for each flow.
一旦完成上述配置,加速的RFS将被用于自动将数据移动到与处理该流数据的CPU核心绑定的RX队列,您无需为每个流手动指定ntuple过滤规则。
Moving up the network stack with netif_receive_skb
通过 netif_receive_skb 提升网络堆栈
Picking up where we left off with
netif_receive_skb
, which is called from a few places. The two most common (and also the two we’ve already looked at):接续我们之前讨论的
netif_receive_skb
,它被从多个地方调用。其中最常见的是以下两种情况(也是我们已经研究过的两种情况):napi_skb_finish
if the packet is not going to be merged to an existing GRO’d flow, OR
napi_gro_complete
if the protocol layers indicated that it’s time to flush the flow, OR
如果数据包不会被合并到现有的 GRO 流中,则调用
napi_skb_finish
; 或者,如果协议层表明是时候刷新流了,则调用
napi_gro_complete
; Reminder:
netif_receive_skb
and its descendants are operating in the context of a the softirq processing loop and you'll see the time spent here accounted for as sitime
or si
with tools like top
.提醒一下:
netif_receive_skb
及其子函数运行在软中断处理循环的上下文中,因此使用 top 等工具时,此处花费的时间将被计入 sitime 或 si。netif_receive_skb
begins by first checking a sysctl
value to determine if the user has requested receive timestamping before or after a packet hits the backlog queue. If this setting is enabled, the data is timestamped now, prior to it hitting RPS (and the CPU’s associated backlog queue). If this setting is disabled, it will be timestamped after it hits the queue. This can be used to distribute the load of timestamping amongst multiple CPUs if RPS is enabled, but will introduce some delay as a result.netif_receive_skb
首先会检查一个 sysctl 值,以确定用户是否请求在数据包进入回环队列之前或之后进行接收时间戳记录。如果此设置启用,则数据将在进入 RPS(以及 CPU 的相关回环队列)之前立即进行时间戳记录。如果此设置禁用,则会在数据进入队列后进行时间戳记录。如果启用了 RPS,这可用于在多个 CPU 之间分配时间戳记录的负载,但也会因此引入一些延迟。Tuning: RX packet timestamping(调优:RX 数据包时间戳)
You can tune when packets will be timestamped after they are received by adjusting a sysctl named
net.core.netdev_tstamp_prequeue
:您可以通过调整名为
net.core.netdev_tstamp_prequeue
的 sysctl 来设置数据包在被接收后打时间戳的时间:Disable timestamping for RX packets by adjusting a
sysctl
通过调整 sysctl 来禁用 RX 数据包的时间戳
$ sudo sysctl -w net.core.netdev_tstamp_prequeue=0
The default value is 1. Please see the previous section for an explanation as to what this setting means, exactly.
默认值为 1。有关此设置的确切含义,请参阅上一节的说明。
netif_receive_skb
After the timestamping is dealt with,
netif_receive_skb
operates differently depending on whether or not RPS is enabled. Let’s start with the simpler path first: RPS disabled.时间戳处理完成后,
netif_receive_skb
的操作方式会有所不同,具体取决于 RPS 是否启用。我们先从较简单的路径开始:RPS 已禁用。Without RPS (default setting)
If RPS is not enabled,
__netif_receive_skb
is called which does some bookkeeping and then calls __netif_receive_skb_core
to move data closer to the protocol stacks.如果未启用 RPS,
__netif_receive_skb
会被调用,它会进行一些记录工作,然后调用 __netif_receive_skb_core
将数据移近协议栈。We’ll see precisely how
__netif_receive_skb_core
works, but first let’s see how the RPS enabled code path works, as that code will also call __netif_receive_skb_core
.我们稍后会精确地看到
__netif_receive_skb_core
的工作原理,但首先让我们看看启用 RPS 的代码路径是如何工作的,因为该代码也会调用 __netif_receive_skb_core
。With RPS enabled
If RPS is enabled, after the timestamping options mentioned above are dealt with,
netif_receive_skb
will perform some computations to determine which CPU’s backlog queue should be used. This is done by using the function get_rps_cpu
. From net/core/dev.c:如果启用了 RPS,在处理完上述时间戳选项后,
netif_receive_skb
将执行一些计算以确定应使用哪个 CPU 的积压队列。这是通过使用函数 get_rps_cpu
来完成的。来自 net/core/dev.c
:cpu = get_rps_cpu(skb->dev, skb, &rflow); if (cpu >= 0) { ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); return ret; }
get_rps_cpu
will take into account RFS and aRFS settings as described above to ensure the the data gets queued to the desired CPU’s backlog with a call to enqueue_to_backlog
.get_rps_cpu
将根据上述描述,考虑 RFS 和 aRFS 设置,通过调用 enqueue_to_backlog
,确保数据被排队到目标 CPU 的 backlog 中。enqueue_to_backlog
This function begins by getting a pointer to the remote CPU’s
softnet_data
structure, which contains a pointer to the input_pkt_queue
. Next, the queue length of the input_pkt_queue
of the remote CPU is checked. From net/core/dev.c:此函数首先获取远程CPU的
softnet_data
结构体指针,该结构体包含指向 input_pkt_queue
的指针。接下来,检查远程CPU的 input_pkt_queue
队列长度。来自 net/core/dev.c
:qlen = skb_queue_len(&sd->input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
The length of
input_pkt_queue
is first compared to netdev_max_backlog
. If the queue is longer than this value, the data is dropped. Similarly, the flow limit is checked and if it has been exceeded, the data is dropped. In both cases the drop count on the softnet_data
structure is incremented. Note that this is the softnet_data
structure of the CPU the data was going to be queued to. Read the section above about /proc/net/softnet_stat
to learn how to get the drop count for monitoring purposes.输入数据包队列的长度首先与
netdev_max_backlog
进行比较。如果队列长度超过该值,则丢弃数据。同样,也会检查流限制,如果已超过,则丢弃数据。在这两种情况下,都会增加 softnet_data
结构上的丢弃计数。请注意,这是数据将要排队到的 CPU 的 softnet_data
结构。有关如何获取丢弃计数以进行监控,请阅读上面关于 /proc/net/softnet_stat
的部分。enqueue_to_backlog
is not called in many places. It is called for RPS-enabled packet processing and also from netif_rx
. Most drivers should not be using netif_rx
and should instead be using netif_receive_skb
. If you are not using RPS and your driver is not using netif_rx
, increasing the backlog won’t produce any noticeable effect on your system as it is not used.enqueue_to_backlog
在许多地方并未被调用。它仅用于支持 RPS 的数据包处理,以及从 netif_rx
调用。大多数驱动程序不应使用 netif_rx
,而应改用 netif_receive_skb
。如果你未使用 RPS 且你的驱动程序未使用 netif_rx
,增加 backlog
对你的系统不会产生任何明显影响,因为它不会被使用。Note: You need to check the driver you are using. If it calls
netif_receive_skb
and you are not using RPS, increasing the netdev_max_backlog
will not yield any performance improvement because no data will ever make it to the input_pkt_queue
.注意:您需要检查所使用的驱动程序。如果它调用了
netif_receive_skb
,而您未使用 RPS,则增加netdev_max_backlog
不会带来任何性能提升,因为数据永远不会进入 input_pkt_queue
。Assuming that the
input_pkt_queue
is small enough and the flow limit (more about this, next) hasn’t been reached (or is disabled), the data can be queued. The logic here is a bit funny, but can be summarized as:假设输入数据包队列足够小,并且流量限制(稍后会详细介绍)尚未达到(或已禁用),则可以将数据排队。这里的逻辑有点奇怪,但可以概括为:
- If the queue is empty: check if NAPI has been started on the remote CPU. If not, check if an IPI is queued to be sent. If not, queue one and start the NAPI processing loop by calling
____napi_schedule
. Proceed to queuing the data. - 如果队列为空,则检查远程CPU上的NAPI是否已启动。如果没有启动,则检查是否已排队待发送的 IPI 中断请求。如果没有,则排队一个并调用
____napi_schedule
启动NAPI处理循环。然后继续排队数据。
- If the queue is not empty, or the previously described operation has completed, enqueue the data.
- 如果队列不为空,或者前面描述的操作已完成,则将数据入队。
The code is a bit tricky with its use of
goto
, so read it carefully. From net/core/dev.c:代码使用了 goto 语句,有点复杂,所以请仔细阅读。来自 net/core/dev.c:
if (skb_queue_len(&sd->input_pkt_queue)) { enqueue: __skb_queue_tail(&sd->input_pkt_queue, skb); input_queue_tail_incr_save(sd, qtail); rps_unlock(sd); local_irq_restore(flags); return NET_RX_SUCCESS; } /* Schedule NAPI for backlog device * We can use non atomic operation since we own the queue lock */ if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) { if (!rps_ipi_queued(sd)) ____napi_schedule(sd, &sd->backlog); } goto enqueue;
Flow limits
RPS distributes packet processing load amongst multiple CPUs, but a single large flow can monopolize CPU processing time and starve smaller flows. Flow limits are a feature that can be used to limit the number of packets queued to the backlog for each flow to a certain amount. This can help ensure that smaller flows are processed even though much larger flows are pushing packets in.
RPS 将数据包处理负载分配到多个 CPU 上,但单个大流量可能会独占 CPU 处理时间并导致小流量得不到服务。流限制是一种功能,可用于将排队到每个流的积压队列中的数据包数量限制在一定数量内。这有助于确保即使有大量流量涌入,较小的流量也能得到处理。
The if statement above from net/core/dev.c checks the flow limit with a call to
skb_flow_limit
:来自 net/core/dev.c 的上述 if 语句通过调用
skb_flow_limit
检查流限制:if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
This code is checking that there is still room in the queue and that the flow limit has not been reached. By default, flow limits are disabled. In order to enable flow limits, you must specify a bitmap (similar to RPS’ bitmap).
这段代码正在检查队列中是否还有剩余空间,以及流量限制是否已被达到。默认情况下,流量限制是禁用的。为了启用流量限制,您必须指定一个位图(类似于RPS的位图)。
Monitoring: Monitor drops due to full input_pkt_queue
or flow limit
监控:由于输入数据包队列或流限制满而导致的丢包
See the section above about monitoring
/proc/net/softnet_stat
. The dropped
field is a counter that gets incremented each time data is dropped instead of queued to a CPU’s input_pkt_queue
.请参阅上面有关监控
/proc/net/softnet_stat
的部分。dropped
字段是一个计数器,每当数据被丢弃而不是排队到 CPU 的 input_pkt_queue
时,该计数器就会递增。Tuning
Tuning: Adjusting
netdev_max_backlog
to prevent drops调优:调整 netdev_max_backlog 以防止丢包。
Before adjusting this tuning value, see the note in the previous section.
You can help prevent drops in
enqueue_to_backlog
by increasing the netdev_max_backlog
if you are using RPS or if your driver calls netif_rx
.在调整此调优值之前,请参阅上一节中的注释。如果您使用 RPS 或您的驱动程序调用
netif_rx
,则可以通过增加 netdev_max_backlog
来帮助防止 enqueue_to_backlog
中的丢包。Example: increase backlog to 3000 with
sysctl
.例如:使用 sysctl 将 backlog 增加到 3000。
$ sudo sysctl -w net.core.netdev_max_backlog=3000
The default value is 1000.
Tuning: Adjust the NAPI weight of the backlog
poll
loop调优:调整后台轮询循环的 NAPI 权重
You can adjust the weight of the backlog’s NAPI poller by setting the
net.core.dev_weight
sysctl. Adjusting this value determines how much of the overall budget the backlog poll
loop can consume (see the section above about adjusting net.core.netdev_budget
):您可以设置
net.core.dev_weight
sysctl 来调整积压队列 NAPI 轮询器的权重。调整该值可决定积压队列轮询循环可以消耗的整体预算量(请参阅上面有关调整 net.core.netdev_budget
的部分):Example: increase the NAPI
poll
backlog processing loop with sysctl
.示例:使用 sysctl 增加 NAPI 轮询积压处理循环。
$ sudo sysctl -w net.core.dev_weight=600
The default value is 64.
Remember, backlog processing runs in the softirq context similar to the device driver’s registered
poll
function and will be limited by the overall budget
and a time limit, as described in previous sections.请记住,积压任务的处理运行在与设备驱动程序注册的轮询函数类似的软中断上下文中,并将受到整体预算和时间限制的约束,如前几节所述。
Tuning: Enabling flow limits and tuning flow limit hash table size
调优:启用流量限制并调整流量限制哈希表大小
Set the size of the flow limit table with a
sysctl
.使用 sysctl 设置流限制表的大小。
$ sudo sysctl -w net.core.flow_limit_table_len=8192
The default value is 4096.
This change only affects newly allocated flow hash tables. So, if you’d like to increase the table size, you should do it before you enable flow limits.
此更改仅影响新分配的流哈希表。因此,如果您想增加表大小,应在启用流限制之前进行。
To enable flow limits you should specify a bitmask in
/proc/sys/net/core/flow_limit_cpu_bitmap
similar to the RPS bitmask which indicates which CPUs have flow limits enabled.要启用流限制,您应该在
/proc/sys/net/core/flow_limit_cpu_bitmap
中指定一个位掩码,类似于 RPS 位掩码,它指示哪些 CPU 已启用流限制。backlog queue NAPI poller 后端队列NAPI轮询器
The per-CPU backlog queue plugs into NAPI the same way a device driver does. A
poll
function is provided that is used to process packets from the softirq context. A weight
is also provided, just as a device driver would.每个CPU的积压队列以与设备驱动程序相同的方式接入NAPI。提供了一个轮询函数,用于从软中断上下文处理数据包。还提供了一个权重,就像设备驱动程序一样。
This NAPI struct is provided during initialization of the networking system. From
net_dev_init
in net/core/dev.c
:此 NAPI 结构体在联网系统的初始化期间提供。来自 net/core/dev.c 中的 net_dev_init:
sd->backlog.poll = process_backlog; sd->backlog.weight = weight_p; sd->backlog.gro_list = NULL; sd->backlog.gro_count = 0;
The backlog NAPI structure differs from the device driver NAPI structure in that the
weight
parameter is adjustable, where as drivers hardcode their NAPI weight to 64. We’ll see in the tuning section below how to adjust the weight using a sysctl
.NAPI 结构的 backlog 与设备驱动程序 NAPI 结构不同,其权重参数是可调的,而驱动程序则将 NAPI 权重硬编码为 64。我们将在下面的调优部分中看到如何使用 sysctl 调整权重。
process_backlog
The
process_backlog
function is a loop which runs until its weight (as described in the previous section) has been consumed or no more data remains on the backlog.process_backlog
函数是一个循环,直到其权重(如前一节所述)被消耗完或队列中不再有数据为止。Each piece of data on the backlog queue is removed from the backlog queue and passed on to
__netif_receive_skb
. The code path once the data hits __netif_receive_skb
is the same as explained above for the RPS disabled case. Namely, __netif_receive_skb
does some bookkeeping prior to calling __netif_receive_skb_core
to pass network data up to the protocol layers.队列中的每条数据都会从队列中移除,并传递给 __netif_receive_skb 。一旦数据到达 __netif_receive_skb ,代码路径与上述 RPS 禁用情况下的处理方式相同。即,__netif_receive_skb 在调用 __netif_receive_skb_core 之前会进行一些记录工作,然后将网络数据传递到协议层。
process_backlog
follows the same contract with NAPI that device drivers do, which is: NAPI is disabled if the total weight will not be used. The poller is restarted with the call to ____napi_schedule
from enqueue_to_backlog
as described above.process_backlog 与 NAPI 的约定与设备驱动程序的约定相同,即:如果总权重不会被使用,则禁用 NAPI。通过 enqueue_to_backlog 中对 ____napi_schedule 的调用重新启动轮询器,如上所述。
The function returns the amount of work done, which
net_rx_action
(described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget
, as described above).该函数返回已完成的工作量,net_rx_action (如上所述)会从预算中减去该值(预算由 net.core.netdev_budget 调整,如上所述)。
__netif_receive_skb_core
delivers data to packet taps and protocol layers
__netif_receive_skb_core 将数据传递给数据包监听器和协议层
__netif_receive_skb_core
performs the heavy lifting of delivering the data to protocol stacks. Before it does this, it checks if any packet taps have been installed which are catching all incoming packets. One example of something that does this is the AF_PACKET
address family, typically used via the libpcap library.__netif_receive_skb_core
执行将数据传递给协议栈的繁重工作。在执行此操作之前,它会检查是否安装了任何捕获所有传入数据包的数据包抓取器(packet taps)。一个典型例子是 AF_PACKET
地址族,通常通过 libpcap 库使用。If such a tap exists, the data is delivered there first then to the protocol layers next.
如果存在这样的抓取器,则首先将数据传递到那里,然后再传递到协议层。
Packet tap delivery(数据包捕获传输)
If a packet tap is installed (usually via libpcap), the packet is delivered there with the following code from net/core/dev.c:
如果安装了数据包捕获(通常通过 libpcap),则数据包将通过来自 net/core/dev.c 的以下代码传递到那里:
list_for_each_entry_rcu(ptype, &ptype_all, list) { if (!ptype->dev || ptype->dev == skb->dev) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }
If you are curious about how the path of the data through pcap, read net/packet/af_packet.c.
如果你对数据通过 pcap 的路径感到好奇,请阅读 net/packet/af_packet.c。
Protocol layer delivery
Once the taps have been satisfied,
__netif_receive_skb_core
delivers data to protocol layers. It does this by obtaining the protocol field from the data and iterating across a list of deliver functions registered for that protocol type.一旦满足了数据包,
__netif_receive_skb_core
将数据传递给协议层。它通过从数据中获取协议字段,并遍历为该协议类型注册的传递函数列表来实现这一点。This can be seen in
__netif_receive_skb_core
in net/core/dev.c:这可以在 net/core/dev.c 中的
__netif_receive_skb_core
中看到:type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }
The
ptype_base
identifier above is defined as a hash table of lists in net/core/dev.c:上述 ptype_base 标识符在 net/core/dev.c 中被定义为一个链表哈希表:
struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
Each protocol layer adds a filter to a list at a given slot in the hash table, computed with a helper function called
ptype_head
:每个协议层都会在哈希表中给定槽位处添加一个过滤器到列表中,该槽位由名为 ptype_head 的辅助函数计算得出:
static inline struct list_head *ptype_head(const struct packet_type *pt) { if (pt->type == htons(ETH_P_ALL)) return &ptype_all; else return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK]; }
Adding a filter to the list is accomplished with a call to
dev_add_pack
. That is how protocol layers register themselves for network data delivery for their protocol type.通过调用
dev_add_pack
来向列表中添加过滤器。这就是协议层如何注册自己以接收其协议类型的数据的。And now you know how network data gets from the NIC to the protocol layer.
现在你知道网络数据是如何从网卡(NIC)到达协议层的了。
Protocol layer registration(协议层注册)
Now that we know how data is delivered to the protocol stacks from the network device subsystem, let’s see how a protocol layer registers itself.
现在我们已经了解了数据如何从网络设备子系统传递到协议栈,让我们看看协议层是如何注册自身的。
This blog post is going to examine the IP protocol stack as it is a commonly used protocol and will be relevant to most readers.
本篇博客将研究IP协议栈,因为它是一种常用的协议,对大多数读者来说都具有相关性。
IP protocol layer
The IP protocol layer plugs itself into the
ptype_base
hash table so that data will be delivered to it from the network device layer described in previous sections.This happens in the function
inet_init
from net/ipv4/af_inet.c:IP协议层通过将自身插入
ptype_base
哈希表中,从而实现数据从前面章节描述的网络设备层传递到IP协议层。这发生在 net/ipv4/af_inet.c 中的 inet_init
函数中:dev_add_pack(&ip_packet_type);
This registers the IP packet type structure defined at net/ipv4/af_inet.c:
此注册了在 net/ipv4/af_inet.c 中定义的 IP 数据包类型结构体:
static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, };
__netif_receive_skb_core
calls deliver_skb
(as seen in the above section), which calls func
(in this case, ip_rcv
).__netif_receive_skb_core
调用 deliver_skb
(如上节所示),后者调用 func(在本例中为 ip_rcv)。ip_rcv
The
ip_rcv
function is pretty straight-forward at a high level. There are several integrity checks to ensure the data is valid. Statistics counters are bumped, as well.ip_rcv 函数在高层次上相当直接。有几个完整性检查以确保数据有效。统计计数器也会相应增加。
ip_rcv
ends by passing the packet to ip_rcv_finish
by way of netfilter. This is done so that any iptables rules that should be matched at the IP protocol layer can take a look at the packet before it continues on.ip_rcv 最后通过 netfilter 将数据包传递给 ip_rcv_finish。这样做是为了让任何应在 IP 协议层匹配的 iptables 规则可以在数据包继续处理之前查看该数据包。
We can see the code which hands the data over to netfilter at the end of
ip_rcv
in net/ipv4/ip_input.c:我们可以在 net/ipv4/ip_input.c 中看到将数据交给 netfilter 的代码,位于 ip_rcv 的末尾:
return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
netfilter and iptables
In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into netfilter, iptables, and conntrack.
为了简洁起见(以及避免我的重复性劳损),我决定跳过对netfilter、iptables和conntrack的深入探讨。
The short version is that
NF_HOOK_THRESH
will check if any filters are installed and attempt to return execution back to the IP protocol layer to avoid going deeper into netfilter and anything that hooks in below that like iptables and conntrack.简而言之,
NF_HOOK_THRESH
会检查是否安装了任何过滤器,并尝试将执行权返回给IP协议层,以避免进一步深入到netfilter及其以下挂钩的iptables和conntrack中。Keep in mind: if you have numerous or very complex netfilter or iptables rules, those rules will be executed in the softirq context and can lead to latency in your network stack. This may be unavoidable, though, if you need to have a particular set of rules installed.
请注意:如果你有大量或非常复杂的netfilter或iptables规则,这些规则将在softirq上下文中执行,并可能导致网络堆栈的延迟。不过,如果你需要安装一组特定的规则,这种情况可能是不可避免的。
ip_rcv_finish
Once net filter has had a chance to take a look at the data and decide what to do with it,
ip_rcv_finish
is called. This only happens if the data is not being dropped by netfilter, of course.一旦网络过滤器有机会查看数据并决定如何处理,就会调用ip_rcv_finish。当然,这只有在数据没有被netfilter丢弃的情况下才会发生。
ip_rcv_finish
begins with an optimization. In order to deliver the packet to proper place, a dst_entry
from the routing system needs to be in place. In order to obtain one, the code initially attempts to call the early_demux
function from the higher level protocol that this data is destined for.ip_rcv_finish首先进行了一项优化。为了将数据包交付到正确的位置,需要路由系统中的dst_entry。为了获取这个dst_entry,代码首先尝试调用更高层协议的early_demux函数,该协议是数据的目标协议。
The
early_demux
routine is an optimization which attempts to find the dst_entry
needed to deliver the packet by checking if a dst_entry
is cached on the socket structure.Here’s what that looks like from net/ipv4/ip_input.c:
early_demux例程是一种优化方法,它通过检查套接字结构上是否缓存了dst_entry来尝试找到交付数据包所需的dst_entry。以下是来自net/ipv4/ip_input.c的实现:
if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) { const struct net_protocol *ipprot; int protocol = iph->protocol; ipprot = rcu_dereference(inet_protos[protocol]); if (ipprot && ipprot->early_demux) { ipprot->early_demux(skb); /* must reload iph, skb->head might have changed */ iph = ip_hdr(skb); } }
As you can see above, this code is guarded by a sysctl
sysctl_ip_early_demux
. By default early_demux
is enabled. The next section includes information about how to disable it and why you might want to.如上所示,这段代码由一个 sysctl sysctl_ip_early_demux 保护。默认情况下,early_demux 是启用的。下一节包含有关如何禁用它以及您可能希望这样做的原因的信息。
If the optimization is enabled and there is no cached entry (because this is the first packet arriving), the packet will be handed off to the routing system in the kernel where the
dst_entry
will be computed and assigned.如果优化已启用,并且没有缓存条目(因为这是到达的第一个数据包),则数据包将被传递到内核中的路由系统,在那里将计算并分配 dst_entry。
Once the routing layer completes, statistics counters are updated and the function ends by calling
dst_input(skb)
which in turn calls the input function pointer on the packet’s dst_entry
structure that was affixed by the routing system.一旦路由层完成,统计计数器将被更新,该函数通过调用 dst_input(skb) 结束,而 dst_input(skb) 又会调用路由系统附加在数据包的 dst_entry 结构上的输入函数指针。
If the packet’s final destination is the local system, the routing system will attach the function
ip_local_deliver
to the input function pointer in the dst_entry
structure on the packet.如果数据包的最终目的地是本地系统,路由系统将在数据包的 dst_entry 结构中将函数 ip_local_deliver 附加到输入函数指针上。
Tuning: adjusting IP protocol early demux(调优:调整IP协议早期解复用)
Disable the
early_demux
optimization by setting a sysctl
.通过设置 sysctl 来禁用早期解复用优化。
$ sudo sysctl -w net.ipv4.ip_early_demux=0
The default value is 1;
early_demux
is enabled.This sysctl was added as some users saw a ~5% decrease in throughput with the
early_demux
optimization in some cases.此 sysctl 被添加,因为一些用户在某些情况下发现早期的 demux 优化导致吞吐量下降了约 5%。
ip_local_deliver
Recall how we saw the following pattern in the IP protocol layer:
回想一下,我们在 IP 协议层中看到的以下模式:
- Calls to
ip_rcv
do some initial bookkeeping.
- Packet is handed off to netfilter for processing, with a pointer to a callback to be executed when processing finishes.
ip_rcv_finish
is the callback which finished processing and continued working toward pushing the packet up the networking stack.
- 对 ip_rcv 的调用会进行一些初始的账务处理。
- 数据包被传递给 netfilter 进行处理,并附带一个指向回调函数的指针,该回调函数将在处理完成后执行。
- ip_rcv_finish 是完成处理并继续将数据包向上推送到网络协议栈的回调函数。
ip_local_deliver
has the same pattern. From net/ipv4/ip_input.c:ip_local_deliver 具有相同的模式。来自 net/ipv4/ip_input.c:
/* * Deliver IP Packets to the higher protocol layers. */ int ip_local_deliver(struct sk_buff *skb) { /* * Reassemble IP fragments. */ if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); }
Once netfilter has had a chance to take a look at the data,
ip_local_deliver_finish
will be called, assuming the data is not dropped first by netfilter.一旦 netfilter 有机会查看数据,
ip_local_deliver_finish
将被调用,假设数据未被 netfilter 首先丢弃。ip_local_deliver_finish
ip_local_deliver_finish
obtains the protocol from the packet, looks up a net_protocol
structure registered for that protocol, and calls the function pointed to by handler
in the net_protocol
structure.ip_local_deliver_finish
从数据包中获取协议,查找为该协议注册的 net_protocol
结构,并调用 net_protocol
结构中 handler 指向的函数。This hands the packet up to the higher level protocol layer.
这将数据包传递给更高级别的协议层。
Monitoring: IP protocol layer statistics
Monitor detailed IP protocol statistics by reading
/proc/net/snmp
.通过读取 /proc/net/snmp 来监控详细的 IP 协议统计信息。
$ cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 12987882 51 1 10129840 2196520 1 0 0 0 ...
This file contains statistics for several protocol layers. The IP protocol layer appears first. The first line contains space separate names for each of the corresponding values in the next line.
此文件包含多个协议层的统计信息。IP 协议层首先出现。第一行包含空格分隔的名称,对应下一行中的每个值。
In the IP protocol layer, you will find statistics counters being bumped. Those counters are referenced by a C enum. All of the valid enum values and the field names they correspond to in
/proc/net/snmp
can be found in include/uapi/linux/snmp.h:在 IP 协议层中,您会发现统计计数器被增加。这些计数器由 C 枚举引用。所有有效的枚举值及其在 /proc/net/snmp 中对应的字段名称可在 include/uapi/linux/snmp.h 中找到:
enum { IPSTATS_MIB_NUM = 0, /* frequently written fields in fast path, kept in same cache line */ IPSTATS_MIB_INPKTS, /* InReceives */ IPSTATS_MIB_INOCTETS, /* InOctets */ IPSTATS_MIB_INDELIVERS, /* InDelivers */ IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */ IPSTATS_MIB_OUTPKTS, /* OutRequests */ IPSTATS_MIB_OUTOCTETS, /* OutOctets */ /* ... */
Monitor extended IP protocol statistics by reading
/proc/net/netstat
.通过读取 /proc/net/netstat 监控扩展的 IP 协议统计信息。
$ cat /proc/net/netstat | grep IpExt IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0
The format is similar to
/proc/net/snmp
, except the lines are prefixed with IpExt
.格式与 /proc/net/snmp 类似,但行前缀为 IpExt。
Some interesting statistics:
一些有趣的统计信息:
InReceives
: The total number of IP packets that reachedip_rcv
before any data integrity checks.- InReceives:在进行任何数据完整性检查之前到达 ip_rcv 的 IP 数据包总数。
InHdrErrors
: Total number of IP packets with corrupted headers. The header was too short, too long, non-existent, had the wrong IP protocol version number, etc.- InHdrErrors:包含损坏报头的 IP 数据包总数。报头太短、太长、不存在、IP 协议版本号错误等。
InAddrErrors
: Total number of IP packets where the host was unreachable.- InAddrErrors:主机无法访问的 IP 数据包总数。
ForwDatagrams
: Total number of IP packets that have been forwarded.- ForwDatagrams:已转发的 IP 数据包总数。
InUnknownProtos
: Total number of IP packets with unknown or unsupported protocol specified in the header.- InUnknownProtos:报头中指定的未知或不支持协议的 IP 数据包总数。
InDiscards
: Total number of IP packets discarded due to memory allocation failure or checksum failure when packets are trimmed.- InDiscards:由于内存分配失败或数据包修剪时校验和失败而被丢弃的 IP 数据包总数。
InDelivers
: Total number of IP packets successfully delivered to higher protocol layers. Keep in mind that those protocol layers may drop data even if the IP layer does not.- InDelivers:成功传递到更高协议层的 IP 数据包总数。请记住,即使 IP 层没有丢弃数据,这些协议层也可能丢弃数据。
InCsumErrors
: Total number of IP Packets with checksum errors.- InCsumErrors:校验和错误的 IP 数据包总数。
Note that each of these is incremented in really specific locations in the IP layer. Code gets moved around from time to time and double counting errors or other accounting bugs can sneak in. If these statistics are important to you, you are strongly encouraged to read the IP protocol layer source code for the metrics that are important to you so you understand when they are (and are not) being incremented.
请注意,这些计数器是在 IP 层非常特定的位置递增的。代码会不时移动,可能会出现重复计数错误或其他计数错误。如果这些统计信息对您很重要,强烈建议您阅读 IP 协议层源代码,以了解对您重要的指标,并了解它们何时(以及何时不)被递增。
Higher level protocol registration
This blog post will examine UDP, but the TCP protocol handler is registered the same way and at the same time as the UDP protocol handler.
这篇博文将探讨 UDP,但 TCP 协议处理程序的注册方式与 UDP 协议处理程序相同,并且是在同一时间进行的。
In
net/ipv4/af_inet.c
, the structure definitions which contain the handler functions for connecting the UDP, TCP , and ICMP protocols to the IP protocol layer can be found. From net/ipv4/af_inet.c:在
net/ipv4/af_inet.c
中,可以找到包含 UDP、TCP 和 ICMP 协议与 IP 协议层连接的处理函数的结构定义。来自 net/ipv4/af_inet.c
:static const struct net_protocol tcp_protocol = { .early_demux = tcp_v4_early_demux, .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol udp_protocol = { .early_demux = udp_v4_early_demux, .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol icmp_protocol = { .handler = icmp_rcv, .err_handler = icmp_err, .no_policy = 1, .netns_ok = 1, };
These structures are registered in the initialization code of the inet address family. From net/ipv4/af_inet.c:
这些结构在 inet 地址族的初始化代码中注册。来自 net/ipv4/af_inet.c:
/* * Add all the base protocols. */ if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) pr_crit("%s: Cannot add ICMP protocol\n", __func__); if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) pr_crit("%s: Cannot add UDP protocol\n", __func__); if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__);
We’re going to be looking at the UDP protocol layer. As seen above, the
handler
function for UDP is called udp_rcv
.我们将要研究 UDP 协议层。如上所示,UDP 的处理函数被称为
udp_rcv
。This is the entry point into the UDP layer where the IP layer hands data. Let’s continue our journey there.
这是进入UDP层的入口,IP层在这里传递数据。让我们继续我们的旅程吧。
UDP protocol layer
The code for the UDP protocol layer can be found in: net/ipv4/udp.c.
UDP协议层的代码可以在:net/ipv4/udp.c中找到。
udp_rcv
The code for the
udp_rcv
function is just a single line which calls directly into __udp4_lib_rcv
to handle receiving the datagram.udp_rcv
函数的代码只有一行,直接调用 __udp4_lib_rcv
来处理接收到的数据报。__udp4_lib_rcv
The
__udp4_lib_rcv
function will check to ensure the packet is valid and obtain the UDP header, UDP datagram length, source address, and destination address. Next, are some additional integrity checks and checksum verification.__udp4_lib_rcv 函数将检查数据包是否有效,并获取 UDP 头部、UDP 数据报长度、源地址和目的地址。接下来,进行一些额外的完整性检查和校验和验证。
Recall that earlier in the IP protocol layer, we saw that an optimization is performed to attach a
dst_entry
to the packet before it is handed off to the upper layer protocol (UDP in our case).回想一下,在前面的 IP 协议层中,我们看到在将数据包传递给上层协议(在本例中为 UDP)之前,会执行一项优化操作,即把 dst_entry 附加到数据包上。
If a socket and corresponding
dst_entry
were found, __udp4_lib_rcv
will queue the packet to the socket:如果找到了套接字和对应的
dst_entry
,__udp4_lib_rcv
将把数据包排队到该套接字:sk = skb_steal_sock(skb); if (sk) { struct dst_entry *dst = skb_dst(skb); int ret; if (unlikely(sk->sk_rx_dst != dst)) udp_sk_rx_dst_set(sk, dst); ret = udp_queue_rcv_skb(sk, skb); sock_put(sk); /* a return value > 0 means to resubmit the input, but * it wants the return to be -protocol, or 0 */ if (ret > 0) return -ret; return 0; } else {
If there is no socket attached from the early_demux operation, a receiving socket will now be looked up by calling
__udp4_lib_lookup_skb
.In both cases described above, the datagram will be queued to the socket:
如果在 early_demux 操作中没有附加套接字,现在将通过调用 __udp4_lib_lookup_skb 查找接收套接字。在上述两种情况下,数据报都将被排队到套接字:
ret = udp_queue_rcv_skb(sk, skb); sock_put(sk);
If no socket was found, the datagram will be dropped:
如果没有找到套接字,数据报将被丢弃:
/* No socket. Drop packet silently, if checksum is wrong */ if (udp_lib_checksum_complete(skb)) goto csum_error; UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); /* * Hmm. We got an UDP packet to a port to which we * don't wanna listen. Ignore it. */ kfree_skb(skb); return 0;
udp_queue_rcv_skb
The initial parts of this function are as follows:
该函数的初始部分如下:
- Determine if the socket associated with the datagram is an encapsulation socket. If so, pass the packet up to that layer’s handler function before proceeding.
- 确定与数据报关联的套接字是否为封装套接字。如果是,则在继续之前将数据包传递给该层的处理函数。
- Determine if the datagram is a UDP-Lite datagram and do some integrity checks.
- 确定数据报是否为 UDP-Lite 数据报并执行一些完整性检查。
- Verify the UDP checksum of the datagram and drop it if the checksum fails.
- 验证数据报的 UDP 校验和,如果校验和失败则丢弃数据报。
Finally, we arrive at the receive queue logic which begins by checking if the receive queue for the socket is full. From
net/ipv4/udp.c
:最后,我们进入接收队列逻辑,首先检查套接字的接收队列是否已满。来自 net/ipv4/udp.c:
if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf)) goto drop;
sk_rcvqueues_full
The
sk_rcvqueues_full
function checks the socket’s backlog length and the socket’s sk_rmem_alloc
to determine if the sum is greater than the sk_rcvbuf
for the socket (sk->sk_rcvbuf
in the above code snippet):sk_rcvqueues_full 函数检查套接字的 backlog 长度和 sk_rmem_alloc 来确定它们之和是否大于该套接字的 sk_rcvbuf(在上面的代码片段中为 sk->sk_rcvbuf):
/* * Take into account size of receive queue and backlog queue * Do not take into account this skb truesize, * to allow even a single big packet to come. */ static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb, unsigned int limit) { unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc); return qsize > limit; }
Tuning these values is a bit tricky as there are many things that can be adjusted.
调整这些数值有点棘手,因为有很多东西可以调整。
Tuning: Socket receive queue memory(调优:套接字接收队列内存)
The
sk->sk_rcvbuf
(called limit in sk_rcvqueues_full
above) value can be increased to whatever the sysctl net.core.rmem_max
is set to.Increase the maximum receive buffer size by setting a
sysctl
.sk->sk_rcvbuf(在上面的sk_rcvqueues_full中称为limit)值可以增加到sysctl net.core.rmem_max设置的任意值。通过设置sysctl来增加最大接收缓冲区大小。
$ sudo sysctl -w net.core.rmem_max=8388608
sk->sk_rcvbuf
starts at the net.core.rmem_default
value, which can also be adjusted by setting a sysctl, like so:Adjust the default initial receive buffer size by setting a
sysctl
.sk->sk_rcvbuf 从 net.core.rmem_default 值开始,也可以通过设置 sysctl 来调整,例如:通过设置 sysctl 调整默认初始接收缓冲区大小。
$ sudo sysctl -w net.core.rmem_default=8388608
You can also set the
sk->sk_rcvbuf
size by calling setsockopt
from your application and passing SO_RCVBUF
. The maximum you can set with setsockopt
is net.core.rmem_max
.您也可以通过调用应用程序中的 setsockopt 并传递 SO_RCVBUF 来设置 sk->sk_rcvbuf 大小。您可以通过 setsockopt 设置的最大值是 net.core.rmem_max 。
However, you can override the
net.core.rmem_max
limit by calling setsockopt
and passing SO_RCVBUFFORCE
, but the user running the application will need the CAP_NET_ADMIN
capability.但是,您可以通过调用 setsockopt 并传递 SO_RCVBUFFORCE 来覆盖 net.core.rmem_max 限制,但运行应用程序的用户需要 CAP_NET_ADMIN 能力。
The
sk->sk_rmem_alloc
value is incremented by calls to skb_set_owner_r
which set the owner socket of a datagram. We’ll see this called later in the UDP layer.当调用 skb_set_owner_r 设置数据报的所有者套接字时,sk->sk_rmem_alloc 值会增加。我们将在 UDP 层中看到这一点。
The
sk->sk_backlog.len
is incremented by calls to sk_add_backlog
, which we’ll see next.当调用 sk_add_backlog 时,sk->sk_backlog.len 的值会增加,我们将在接下来看到这一点。
udp_queue_rcv_skb
Once it’s been verified that the queue is not full, progress toward queuing the datagram can continue. From net/ipv4/udp.c:
一旦确认队列未满,就可以继续将数据报入队。来自 net/ipv4/udp.c:
bh_lock_sock(sk); if (!sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb); else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { bh_unlock_sock(sk); goto drop; } bh_unlock_sock(sk); return rc;
The first step is determine if the socket currently has any system calls against it from a userland program. If it does not, the datagram can be added to the receive queue with a call to
__udp_queue_rcv_skb
. If it does, the datagram is queued to the backlog with a call to sk_add_backlog
.第一步是确定套接字当前是否正在接收来自用户空间程序的任何系统调用。如果没有,可以通过调用__udp_queue_rcv_skb将数据报添加到接收队列。如果有的话,则通过调用sk_add_backlog将其排队到后台队列。
The datagrams on the backlog are added to the receive queue when socket system calls release the socket with a call to
release_sock
in the kernel.后台队列上的数据报在套接字系统调用通过内核中的release_sock释放套接字时被添加到接收队列。
__udp_queue_rcv_skb
The
__udp_queue_rcv_skb
function adds datagrams to the receive queue by calling sock_queue_rcv_skb
and bumps statistics counters if the datagram could not be added to the receive queue for the socket.__udp_queue_rcv_skb 函数通过调用 sock_queue_rcv_skb 将数据报添加到接收队列,并在无法将数据报添加到套接字的接收队列时增加统计计数器。
From net/ipv4/udp.c:
rc = sock_queue_rcv_skb(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); /* Note that an ENOMEM error is charged twice */ if (rc == -ENOMEM) UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,is_udplite); UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite); kfree_skb(skb); trace_udp_fail_queue_rcv_skb(rc, sk); return -1; }
Monitoring: UDP protocol layer statistics
Two very useful files for getting UDP protocol statistics are:
获取UDP协议统计信息的两个非常有用的文件是:
/proc/net/snmp
/proc/net/udp
/proc/net/snmp
Monitor detailed UDP protocol statistics by reading
/proc/net/snmp
.通过读取 /proc/net/snmp 监控详细的 UDP 协议统计信息。
$ cat /proc/net/snmp | grep Udp\: Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 16314 0 0 17161 0 0
Much like the detailed statistics found in this file for the IP protocol, you will need to read the protocol layer source to determine exactly when and where these values are incremented.
与该文件中针对IP协议的详细统计信息类似,你需要阅读协议层源代码来确定这些值在何时何地被精确地增加。
InDatagrams
: Incremented whenrecvmsg
was used by a userland program to read datagram. Also incremented when a UDP packet is encapsulated and sent back for processing.- InDatagrams:当用户程序使用recvmsg读取数据报时增加。当UDP数据包被封装并返回处理时也会增加。
NoPorts
: Incremented when UDP packets arrive destined for a port where no program is listening.- NoPorts:当UDP数据包到达一个没有程序监听的端口时增加。
InErrors
: Incremented in several cases: no memory in the receive queue, when a bad checksum is seen, and ifsk_add_backlog
fails to add the datagram.- InErrors:在几种情况下增加:接收队列内存不足、看到坏校验和以及sk_add_backlog无法添加数据报时。
OutDatagrams
: Incremented when a UDP packet is handed down without error to the IP protocol layer to be sent.- OutDatagrams:当UDP数据包无误地传递到IP协议层以发送时增加。
RcvbufErrors
: Incremented whensock_queue_rcv_skb
reports that no memory is available; this happens ifsk->sk_rmem_alloc
is greater than or equal tosk->sk_rcvbuf
.- RcvbufErrors:当sock_queue_rcv_skb报告内存不足时增加;这种情况发生在sk->sk_rmem_alloc大于或等于sk->sk_rcvbuf时。
SndbufErrors
: Incremented if the IP protocol layer reported an error when trying to send the packet and no error queue has been setup. Also incremented if no send queue space or kernel memory are available.- SndbufErrors:如果IP协议层在尝试发送数据包时报告错误且未设置错误队列时增加。如果没有发送队列空间或内核内存可用时也会增加。
-
InCsumErrors
: Incremented when a UDP checksum failure is detected. Note that in all cases I could find,InCsumErrors
is incrememnted at the same time asInErrors
. Thus,InErrors
-InCsumErros
should yield the count of memory related errors on the receive side. - InCsumErrors:当检测到UDP校验和失败时增加。需要注意的是,在我找到的所有情况下,InCsumErrors都会与InErrors同时增加。因此,InErrors - InCsumErrors应该给出接收端内存相关错误的数量。
/proc/net/udp
Monitor UDP socket statistics by reading
/proc/net/udp
$ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000 104 0 7518 2 0000000000000000 0 558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7408 2 0000000000000000 0 588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7511 2 0000000000000000 0 769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7673 2 0000000000000000 0 812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7407 2 0000000000000000 0
The first line describes each of the fields in the lines following:
第一行描述了后续各行中的每个字段:
sl
: Kernel hash slot for the socket- sl:套接字的内核哈希槽
local_address
: Hexadecimal local address of the socket and port number, separated by:
.- 本地地址:套接字的十六进制本地地址和端口号,以冒号分隔。
rem_address
: Hexadecimal remote address of the socket and port number, separated by:
.- 远程地址:套接字的十六进制远程地址和端口号,以冒号分隔。
st
: The state of the socket. Oddly enough, the UDP protocol layer seems to use some TCP socket states. In the example above,7
isTCP_CLOSE
.- 状态:套接字的状态。奇怪的是,UDP协议层似乎使用了一些TCP套接字的状态。在上面的例子中,7表示TCP_CLOSE。
tx_queue
: The amount of memory allocated in the kernel for outgoing UDP datagrams.- 发送队列:内核为传出UDP数据报分配的内存大小。
rx_queue
: The amount of memory allocated in the kernel for incoming UDP datagrams.- 接收队列:内核为传入UDP数据报分配的内存大小。
tr
,tm->when
,retrnsmt
: These fields are unused by the UDP protocol layer.- tr、tm->when、retrnsmt:这些字段未被UDP协议层使用。
uid
: The effective user id of the user who created this socket.- 用户ID:创建此套接字的用户的有效用户ID。
timeout
: Unused by the UDP protocol layer.- 超时:UDP协议层未使用。
inode
: The inode number corresponding to this socket. You can use this to help you determine which user process has this socket open. Check/proc/[pid]/fd
, which will contain symlinks tosocket[:inode]
.- 索引节点:与此套接字对应的索引节点号。您可以使用它来帮助您确定哪个用户进程打开了此套接字。检查/proc/[pid]/fd,其中包含指向socket[:inode]的符号链接。
ref
: The current reference count for the socket.- 引用计数:套接字当前的引用计数。
pointer
: The memory address in the kernel of thestruct sock
.- 指针:内核中struct sock的内存地址。
drops
: The number of datagram drops associated with this socket. Note that this does not include any drops related to sending datagrams (on corked UDP sockets or otherwise); this is only incremented in receive paths as of the kernel version examined by this blog post.- 丢弃:与此套接字关联的数据报丢弃数量。请注意,这不包括与发送数据报相关的任何丢弃(无论是通过corked UDP套接字还是其他方式);仅在接收路径中增加,截至本博客文章所检查的内核版本。
The code which outputs this can be found in
net/ipv4/udp.c
.可在 net/ipv4/udp.c 中找到输出此内容的代码。
Queuing data to a socket(将数据排队到套接字)
Network data is queued to a socket with a call to
sock_queue_rcv
. This function does a few things before adding the datagram to the queue:通过调用 sock_queue_rcv 将网络数据排队到套接字。在将数据报添加到队列之前,此函数会执行以下操作:
- The socket’s allocated memory is checked to determine if it has exceeded the receive buffer size. If so, the drop count for the socket is incremented.
- 检查套接字分配的内存是否已超过接收缓冲区大小。如果是,则增加该套接字的丢弃计数。
- Next
sk_filter
is used to process any Berkeley Packet Filter filters that have been applied to the socket. - 接下来使用 sk_filter 处理已应用于该套接字的 Berkeley 数据包过滤器。
sk_rmem_schedule
is run to ensure sufficient receive buffer space exists to accept this datagram.- 运行 sk_rmem_schedule,以确保有足够的接收缓冲区空间来接受此数据报。
- Next the size of the datagram is charged to the socket with a call to
skb_set_owner_r
. This incrementssk->sk_rmem_alloc
. - 然后通过调用 skb_set_owner_r 向该套接字收取数据报的大小。这会增加 sk->sk_rmem_alloc。
- The data is added to the queue with a call to
__skb_queue_tail
. - 通过调用 __skb_queue_tail 将数据添加到队列中。
- Finally, any processes waiting on data to arrive in the socket are notified with a call to the
sk_data_ready
notification handler function. - 最后,通过调用 sk_data_ready 通知处理程序函数,通知任何正在等待数据到达该套接字的进程。
And that is how data arrives at a system and traverses the network stack until it reaches a socket and is ready to be read by a user program.
这就是数据如何到达系统并遍历网络堆栈,直到它到达套接字并准备好被用户程序读取。
Extras
There are a few extra things worth mentioning that are worth mentioning which didn’t seem quite right anywhere else.
还有一些额外的事项值得一提,这些事项在其他地方似乎并不合适。
Timestamping
As mentioned in the above blog post, the networking stack can collect timestamps of incoming data. There are sysctl values controlling when/how to collect timestamps when used in conjunction with RPS; see the above post for more information on RPS, timestamping, and where, exactly, in the network stack receive timestamping happens. Some NICs even support timestamping in hardware, too.
正如上述博客文章中所述,网络堆栈可以收集传入数据的时间戳。有一些sysctl值控制何时/如何收集时间戳,当与RPS结合使用时;有关RPS、时间戳以及网络堆栈中接收时间戳的确切位置的更多信息,请参阅上述文章。一些网卡甚至支持硬件时间戳功能。
This is a useful feature if you’d like to try to determine how much latency the kernel network stack is adding to receiving packets.
如果您想确定内核网络堆栈在接收数据包时增加了多少延迟,这是一个有用的功能。
The kernel documentation about timestamping is excellent and there is even an included sample program and Makefile you can check out!.
Determine which timestamp modes your driver and device support with
ethtool -T
.关于时间戳的内核文档非常出色,甚至还包含了一个示例程序和Makefile供您查看!使用ethtool -T确定您的驱动程序和设备支持哪些时间戳模式。
$ sudo ethtool -T eth0 Time stamping parameters for eth0: Capabilities: software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE) software-receive (SOF_TIMESTAMPING_RX_SOFTWARE) software-system-clock (SOF_TIMESTAMPING_SOFTWARE) PTP Hardware Clock: none Hardware Transmit Timestamp Modes: none Hardware Receive Filter Modes: none
This NIC, unfortunately, does not support hardware receive timestamping, but software timestamping can still be used on this system to help me determine how much latency the kernel is adding to my packet receive path.
这个网卡不幸的是不支持硬件接收时间戳,但在这个系统上仍然可以使用软件时间戳,以帮助我确定内核对数据包接收路径增加了多少延迟。
Busy polling for low latency sockets
It is possible to use a socket option called
SO_BUSY_POLL
which will cause the kernel to busy poll for new data when a blocking receive is done and there is no data.可以使用名为 SO_BUSY_POLL 的套接字选项,当执行阻塞接收操作且没有数据时,内核会忙式轮询新数据。
IMPORTANT NOTE: For this option to work, your device driver must support it. Linux kernel 3.13.0’s
igb
driver does not support this option. The ixgbe
driver, however, does. If your driver has a function set to the ndo_busy_poll
field of its struct net_device_ops
structure (mentioned in the above blog post), it supports SO_BUSY_POLL
.重要提示:要使此选项生效,您的设备驱动程序必须支持它。Linux内核3.13.0的igb驱动程序不支持此选项,但ixgbe驱动程序支持。如果您的驱动程序已将函数设置为struct net_device_ops结构体的ndo_busy_poll字段(如上文博客中所述),则它支持SO_BUSY_POLL。
A great paper explaining how this works and how to use it is available from Intel.
Intel提供了一篇很好的论文,解释了其工作原理及使用方法。
When using this socket option for a single socket, you should pass a time value in microseconds as the amount of time to busy poll in the device driver’s receive queue for new data. When you issue a blocking read to this socket after setting this value, the kernel will busy poll for new data.
当对单个套接字使用此套接字选项时,您应以微秒为单位传递一个时间值,作为在设备驱动程序接收队列中忙轮询新数据的时间量。在设置该值后,当您对此套接字发出阻塞读取时,内核将忙轮询新数据。
You can also set the sysctl value
net.core.busy_poll
to a time value in microseconds of how long calls with poll
or select
should busy poll waiting for new data to arrive, as well.This option can reduce latency, but will increase CPU usage and power consumption.
您也可以将sysctl值net.core.busy_poll设置为以微秒为单位的时间值,表示poll或select调用在等待新数据到达时应忙轮询多长时间。此选项可降低延迟,但会增加CPU使用率和功耗。
Netpoll: support for networking in critical contexts
The Linux kernel provides a way for device drivers to be used to send and receive data on a NIC when the kernel has crashed. The API for this is called Netpoll and it is used by a few things, but most notably: kgdb, netconsole.
Linux 内核提供了一种方法,当内核崩溃时,设备驱动程序可用于在网卡(NIC)上发送和接收数据。用于此目的的 API 称为 Netpoll,它被一些功能使用,其中最著名的是:kgdb 和 netconsole。
Most drivers support Netpoll; your driver needs to implement the
ndo_poll_controller
function and attach it to the struct net_device_ops
that is registered during probe (as seen above).大多数驱动程序都支持 Netpoll;您的驱动程序需要实现 ndo_poll_controller 函数,并将其附加到在探测期间注册的 struct net_device_ops(如上所示)。
When the networking device subsystem performs operations on incoming or outgoing data, the netpoll system is checked first to determine if the packet is destined for netpoll.
当网络设备子系统对传入或传出数据执行操作时,首先检查 netpoll 系统,以确定数据包是否发往 netpoll。
For example, we can see the following code in
__netif_receive_skb_core
from net/dev/core.c
:例如,我们可以在来自 net/dev/core.c 的 __netif_receive_skb_core 中看到以下代码:
static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc) { /* ... */ /* if we've gotten here through NAPI, check netpoll */ if (netpoll_receive_skb(skb)) goto out; /* ... */ }
The Netpoll checks happen early in most of the Linux network device subsystem code that deals with transmitting or receiving network data.
Netpoll 检查发生在大多数 Linux 网络设备子系统代码中,这些代码处理网络数据的发送或接收。
Consumers of the Netpoll API can register
struct netpoll
structures by calling netpoll_setup
. The struct netpoll
structure has function pointers for attaching receive hooks, and the API exports a function for sending data.使用 Netpoll API 的用户可以通过调用 netpoll_setup 注册 struct netpoll 结构体。struct netpoll 结构体具有用于附加接收钩子的函数指针,并且该 API 导出了一个用于发送数据的函数。
If you are interested in using the Netpoll API, you should take a look at the
netconsole
driver, the Netpoll API header file, ‘include/linux/netpoll.h`, and this excellent talk.如果你有兴趣使用 Netpoll API,你应该查看一下 netconsole 驱动程序、Netpoll API 头文件 include/linux/netpoll.h 以及这篇精彩的演讲。
SO_INCOMING_CPU
The
SO_INCOMING_CPU
flag was not added until Linux 3.19, but it is useful enough that it should be included in this blog post.SO_INCOMING_CPU 标志直到 Linux 3.19 才被添加,但它已经足够有用,因此应该包含在本博客文章中。
You can use
getsockopt
with the SO_INCOMING_CPU
option to determine which CPU is processing network packets for a particular socket. Your application can then use this information to hand sockets off to threads running on the desired CPU to help increase data locality and CPU cache hits.您可以使用 getsockopt 并设置 SO_INCOMING_CPU 选项,以确定哪个 CPU 正在处理特定套接字的网络数据包。然后,您的应用程序可以利用此信息将套接字分配给运行在所需 CPU 上的线程,以帮助提高数据局部性和 CPU 缓存命中率。
The mailing list message introducing
SO_INCOMING_CPU
provides a short example architecture where this option is useful.介绍 SO_INCOMING_CPU 的邮件列表消息提供了一个简短的示例架构,说明了该选项的用途。
DMA Engines
A DMA engine is a piece of hardware that allows the CPU to offload large copy operations. This frees the CPU to do other tasks while memory copies are done with hardware. Enabling the use of a DMA engine and running code that takes advantage of it, should yield reduced CPU usage.
DMA引擎是一种硬件,允许CPU卸载大型复制操作。这使CPU在内存复制由硬件完成时可以执行其他任务。启用DMA引擎并运行利用它的代码,应能降低CPU使用率。
The Linux kernel has a generic DMA engine interface that DMA engine driver authors can plug into. You can read more about the Linux DMA engine interface in the kernel source Documentation.
Linux内核具有一个通用的DMA引擎接口,DMA引擎驱动程序作者可以将其插入其中。您可以在内核源码文档中了解更多关于Linux DMA引擎接口的信息。
While there are a few DMA engines that the kernel supports, we’re going to discuss one in particular that is quite common: the Intel IOAT DMA engine.
虽然内核支持一些 DMA 引擎,但我们将特别讨论一种非常常见的引擎:英特尔 IOAT DMA 引擎。
Intel’s I/O Acceleration Technology (IOAT)
Many servers include the Intel I/O AT bundle, which is comprised of a series of performance changes.
许多服务器包含英特尔I/O AT软件包,该软件包由一系列性能改进组成。
One of those changes is the inclusion of a hardware DMA engine. You can check your
dmesg
output for ioatdma
to determine if the module is being loaded and if it has found supported hardware.其中一项改进是引入了硬件DMA引擎。您可以通过检查dmesg输出中的ioatdma来确定模块是否已加载以及是否已找到支持的硬件。
The DMA offload engine is used in a few places, most notably in the TCP stack.
DMA卸载引擎在几个地方使用,最显著的是TCP堆栈。
Support for the Intel IOAT DMA engine was included in Linux 2.6.18, but was disabled later in 3.13.11.10 due to some unfortunate data corruption bugs.
Linux 2.6.18中包含了对英特尔IOAT DMA引擎的支持,但由于一些令人遗憾的数据损坏错误,在3.13.11.10中被禁用。
Users on kernels before 3.13.11.10 may be using the
ioatdma
module on their servers by default. Perhaps this will be fixed in future kernel releases.在3.13.11.10之前的内核版本上,用户可能默认在他们的服务器上使用ioatdma模块。也许这将在未来的内核版本中得到修复。
Direct cache access (DCA)
Another interesting feature included with the Intel I/O AT bundle is Direct Cache Access (DCA).
另一个包含在英特尔I/O AT软件包中的有趣特性是直接缓存访问(DCA)。
This feature allows network devices (via their drivers) to place network data directly in the CPU cache. How this works, exactly, is driver specific. For the
igb
driver, you can check the code for the function igb_update_dca
, as well as the code for igb_update_rx_dca
. The igb
driver uses DCA by writing a register value to the NIC.该特性允许网络设备(通过其驱动程序)将网络数据直接放置到CPU缓存中。具体如何实现,取决于驱动程序的特定实现。对于igb驱动程序,你可以查看函数igb_update_dca的代码以及igb_update_rx_dca的代码。igb驱动程序通过向网卡写入寄存器值来使用DCA。
To use DCA, you will need to ensure that DCA is enabled in your BIOS, the
dca
module is loaded, and that your network card and driver both support DCA.要使用DCA,你需要确保在BIOS中启用DCA,加载dca模块,并且你的网卡和驱动程序都支持DCA。
Monitoring IOAT DMA engine
If you are using the
ioatdma
module despite the risk of data corruption mentioned above, you can monitor it by examining some entries in sysfs
.如果尽管存在上述数据损坏风险,您仍在使用 ioatdma 模块,可以通过检查 sysfs 中的一些条目来监控它。
Monitor the total number of offloaded
memcpy
operations for a DMA channel.监控 DMA 通道的总 offloaded memcpy 操作数。
$ cat /sys/class/dma/dma0chan0/memcpy_count 123205655
Similarly, to get the number of bytes offloaded by this DMA channel, you’d run a command like:
同样,要获取此 DMA 通道卸载的字节数,您可以运行类似以下命令:
Monitor total number of bytes transferred for a DMA channel.
监控 DMA 通道传输的总字节数。
$ cat /sys/class/dma/dma0chan0/bytes_transferred 131791916307
Tuning IOAT DMA engine
The IOAT DMA engine is only used when packet size is above a certain threshold. That threshold is called the
copybreak
. This check is in place because for small copies, the overhead of setting up and using the DMA engine is not worth the accelerated transfer.Adjust the DMA engine copybreak with a
sysctl
.IOAT DMA 引擎仅在数据包大小超过一定阈值时使用。该阈值称为 copybreak 。之所以设置此检查,是因为对于小规模复制操作,设置和使用 DMA 引擎的开销并不值得加速传输。通过 sysctl 调整 DMA 引擎的 copybreak。
$ sudo sysctl -w net.ipv4.tcp_dma_copybreak=2048
The default value is 4096.
Conclusion
The Linux networking stack is complicated.
Linux 网络栈很复杂。
It is impossible to monitor or tune it (or any other complex piece of software) without understanding at a deep level exactly what’s going on. Often, out in the wild of the Internet, you may stumble across a sample
sysctl.conf
that contains a set of sysctl values that should be copied and pasted on to your computer. This is probably not the best way to optimize your networking stack.如果不深入理解其内部运行机制,就无法对其进行监控或调优(也无法对任何复杂的软件进行监控或调优)。在互联网的广阔世界中,你经常会遇到一些示例 sysctl.conf 文件,其中包含一组 sysctl 值,这些值应该被复制并粘贴到你的计算机上。但这可能并不是优化网络栈的最佳方法。
Monitoring the networking stack requires careful accounting of network data at every layer. Starting with the drivers and proceeding up. That way you can determine where exactly drops and errors are occurring and then adjust settings to determine how to reduce the errors you are seeing.
There is, unfortunately, no easy way out.
要监控网络栈,需要仔细统计每一层的网络数据,从驱动程序开始逐步向上分析,从而确定错误和丢包的具体位置,并调整相关设置以减少所观察到的错误。遗憾的是,没有简单的解决办法。
欢迎加入“喵星计算机技术研究院”,原创技术文章第一时间推送。

- 作者:tangcuyu
- 链接:https://expoli.tech/articles/2025/04/03/Monitoring%20and%20Tuning%20the%20Linux%20Networking%20Stack%3A%20Receiving%20Data%20%7C%20Packagecloud%20Blog
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章
2023-02-20
[Email] mutt + msmtp + Gmail
2023-07-25
[MIT 6.s081] Lab: Copy-on-Write Fork for xv6 实验记录
2025-07-29
【转载】(一)Linux进程调度器-基础 - LoyenWang - 博客园
2025-03-06
【转载】EtherCAT主站IgH解析(一)--主站初始化、状态机与EtherCAT报文 - 沐多 - 博客园
2025-07-28
【转载】Linux RCU原理剖析(一)-初窥门径 - LoyenWang - 博客园
2025-07-28
【转载】Linux RCU原理剖析(二)-渐入佳境 - LoyenWang - 博客园